
Tom Lieberum 🔸
@lieberum_t
Followers
1K
Following
4K
Media
133
Statuses
955
Trying to reduce AGI x-risk by understanding NNs Interpretability RE @DeepMind BSc Physics from @RWTH 10% pledgee @ https://t.co/Vh2bvwhuwd
London
Joined May 2020
RT @ohlennart: My team at RAND is hiring! . Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and s….
0
44
0
let's preserve CoT!.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
0
0
7
Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!.
Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
0
0
3
RT @leedsharkey: A few months ago, we published . Attribution-based parameter decomposition -- a method for decomposing a network's paramet….
0
21
0
RT @AmandaAskell: Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us.Also tech comp….
0
60
0
RT @robertwiblin: Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in….
0
3K
0
RT @EdTurner42: 1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity. ?!. We're releasi….
0
50
0
RT @davlindner: Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big….
0
3
0
RT @ZacKenton1: We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View,….
job-boards.greenhouse.io
0
37
0
Are you worried about risks from AGI and want to mitigate them?. Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team!. Research Engineers: Research Scientists:.
job-boards.greenhouse.io
4
13
141
RT @tilderesearch: Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like ste….
0
31
0
RT @NeelNanda5: I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET t….
0
10
0
RT @bshlgrs: I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs):.>can you ssh with the use….
0
461
0
Explore these, and all the other Gemma Scope SAEs on HuggingFace. .Embedding SAEs are in the residual stream repos e.g.
huggingface.co
1
0
1
Since Gemma has a vocabulary of 256000, we trained SAEs with width 4096. We release SAEs with varying levels of sparsity trained using a learning rate 3e-4 on 200k steps of batch size 4096. Thanks to @JosephMiller_ for fruitful discussions!.
1
0
1