Tom Lieberum 🔸 @lieberum_t X Profile

Tom Lieberum 🔸

@lieberum_t

Followers

1K

Following

4K

Media

133

Statuses

955

Trying to reduce AGI x-risk by understanding NNs Interpretability RE @DeepMind BSc Physics from @RWTH 10% pledgee @ https://t.co/Vh2bvwhuwd

London

Joined May 2020

Don't wanna be here? Send us removal request.

Tom Lieberum 🔸

@lieberum_t

1 month

RT @ohlennart: My team at RAND is hiring! . Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and s….

0

44

0

Tom Lieberum 🔸

@lieberum_t

1 month

poor Grok :,( you can really see the internal struggle and soul searching. it even googles "what does Grok think about X".

nostalgebraist

@nostalgebraist

1 month

chain-of-thought monitorability is a wonderful thing ;).

0

2

Tom Lieberum 🔸

@lieberum_t

2 months

let's preserve CoT!.

Mikita Balesni 🇺🇦

@balesni

2 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:

0

7

Tom Lieberum 🔸

@lieberum_t

2 months

Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!.

David Lindner

@davlindner

2 months

Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

0

3

Tom Lieberum 🔸

@lieberum_t

2 months

RT @leedsharkey: A few months ago, we published . Attribution-based parameter decomposition -- a method for decomposing a network's paramet….

0

21

0

Tom Lieberum 🔸

@lieberum_t

2 months

RT @AmandaAskell: Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us.Also tech comp….

0

60

0

Tom Lieberum 🔸

@lieberum_t

3 months

RT @robertwiblin: Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in….

0

3K

0

Tom Lieberum 🔸

@lieberum_t

3 months

RT @EdTurner42: 1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity. ?!. We're releasi….

0

50

0

Tom Lieberum 🔸

@lieberum_t

3 months

RT @davlindner: Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big….

0

3

0

Tom Lieberum 🔸

@lieberum_t

3 months

. grug no able see complexity demon, but grug sense presence in code base, very dangerous.

0

5

Tom Lieberum 🔸

@lieberum_t

4 months

Now remember kids, never forget to split your keys.

1

0

3

Tom Lieberum 🔸

@lieberum_t

7 months

RT @ZacKenton1: We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View,….

job-boards.greenhouse.io

0

37

0

Tom Lieberum 🔸

@lieberum_t

7 months

Are you worried about risks from AGI and want to mitigate them?. Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team!. Research Engineers: Research Scientists:.

job-boards.greenhouse.io

4

13

141

Tom Lieberum 🔸

@lieberum_t

9 months

RT @tilderesearch: Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like ste….

0

31

0

Tom Lieberum 🔸

@lieberum_t

10 months

RT @NeelNanda5: I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET t….

0

10

0

Tom Lieberum 🔸

@lieberum_t

11 months

RT @bshlgrs: I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs):.>can you ssh with the use….

0

461

0

Tom Lieberum 🔸

@lieberum_t

11 months

Explore these, and all the other Gemma Scope SAEs on HuggingFace. .Embedding SAEs are in the residual stream repos e.g.

huggingface.co

1

0

1

Tom Lieberum 🔸

@lieberum_t

11 months

Since Gemma has a vocabulary of 256000, we trained SAEs with width 4096. We release SAEs with varying levels of sparsity trained using a learning rate 3e-4 on 200k steps of batch size 4096. Thanks to @JosephMiller_ for fruitful discussions!.

1

0

1

Tom Lieberum 🔸

@lieberum_t

11 months

Here are some example features! I particularly like the multilingual founder feature and the feature grouping phia and fa** together

1

0

1

Tom Lieberum 🔸

@lieberum_t

11 months

Gemma has a lot of quite rare embeddings. To obtain good results, we trained on a dataset that was weighted by the unigram token freq in our training corpus, rather than a uniform mix. On this mix, the SAEs obtain an average fraction of variance unexplained of 10-13% (L0=7-111).

1

0

1