lieberum_t Profile Banner
Tom Lieberum 🔸 Profile
Tom Lieberum 🔸

@lieberum_t

Followers
1K
Following
4K
Media
133
Statuses
955

Trying to reduce AGI x-risk by understanding NNs Interpretability RE @DeepMind BSc Physics from @RWTH 10% pledgee @ https://t.co/Vh2bvwhuwd

London
Joined May 2020
Don't wanna be here? Send us removal request.
@lieberum_t
Tom Lieberum 🔸
1 month
RT @ohlennart: My team at RAND is hiring! . Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and s….
0
44
0
@lieberum_t
Tom Lieberum 🔸
1 month
poor Grok :,( you can really see the internal struggle and soul searching. it even googles "what does Grok think about X".
@nostalgebraist
nostalgebraist
1 month
chain-of-thought monitorability is a wonderful thing ;).
0
0
2
@lieberum_t
Tom Lieberum 🔸
2 months
let's preserve CoT!.
@balesni
Mikita Balesni 🇺🇦
2 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
Tweet media one
0
0
7
@lieberum_t
Tom Lieberum 🔸
2 months
Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!.
@davlindner
David Lindner
2 months
Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
Tweet media one
0
0
3
@lieberum_t
Tom Lieberum 🔸
2 months
RT @leedsharkey: A few months ago, we published . Attribution-based parameter decomposition -- a method for decomposing a network's paramet….
0
21
0
@lieberum_t
Tom Lieberum 🔸
2 months
RT @AmandaAskell: Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us.Also tech comp….
0
60
0
@lieberum_t
Tom Lieberum 🔸
3 months
RT @robertwiblin: Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in….
0
3K
0
@lieberum_t
Tom Lieberum 🔸
3 months
RT @EdTurner42: 1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity. ?!. We're releasi….
0
50
0
@lieberum_t
Tom Lieberum 🔸
3 months
RT @davlindner: Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big….
0
3
0
@lieberum_t
Tom Lieberum 🔸
3 months
. grug no able see complexity demon, but grug sense presence in code base, very dangerous.
0
0
5
@lieberum_t
Tom Lieberum 🔸
4 months
Now remember kids, never forget to split your keys.
Tweet media one
1
0
3
@lieberum_t
Tom Lieberum 🔸
7 months
RT @ZacKenton1: We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View,….
job-boards.greenhouse.io
0
37
0
@lieberum_t
Tom Lieberum 🔸
7 months
Are you worried about risks from AGI and want to mitigate them?. Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team!. Research Engineers: Research Scientists:.
job-boards.greenhouse.io
4
13
141
@lieberum_t
Tom Lieberum 🔸
9 months
RT @tilderesearch: Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like ste….
0
31
0
@lieberum_t
Tom Lieberum 🔸
10 months
RT @NeelNanda5: I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET t….
0
10
0
@lieberum_t
Tom Lieberum 🔸
11 months
RT @bshlgrs: I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs):.>can you ssh with the use….
0
461
0
@lieberum_t
Tom Lieberum 🔸
11 months
Explore these, and all the other Gemma Scope SAEs on HuggingFace. .Embedding SAEs are in the residual stream repos e.g.
Tweet card summary image
huggingface.co
1
0
1
@lieberum_t
Tom Lieberum 🔸
11 months
Since Gemma has a vocabulary of 256000, we trained SAEs with width 4096. We release SAEs with varying levels of sparsity trained using a learning rate 3e-4 on 200k steps of batch size 4096. Thanks to @JosephMiller_ for fruitful discussions!.
1
0
1
@lieberum_t
Tom Lieberum 🔸
11 months
Here are some example features! I particularly like the multilingual founder feature and the feature grouping phia and fa** together
Tweet media one
Tweet media two
Tweet media three
1
0
1
@lieberum_t
Tom Lieberum 🔸
11 months
Gemma has a lot of quite rare embeddings. To obtain good results, we trained on a dataset that was weighted by the unigram token freq in our training corpus, rather than a uniform mix. On this mix, the SAEs obtain an average fraction of variance unexplained of 10-13% (L0=7-111).
1
0
1