Tom Lieberum 🔸
@lieberum_t
Followers
1K
Following
4K
Media
133
Statuses
956
Trying to reduce AGI x-risk by accelerating alignment researchers Research Engineer @DeepMind BSc Physics from @RWTH 10% pledgee @ https://t.co/Vh2bvwhuwd
London
Joined May 2020
Building artificial super intelligence (AI that far surpasses human intelligence) is easily the most far l-reaching and pivotal task humanity will ever undertake. We should be extremely sure it's going to be safe before undertaking it. https://t.co/5vsPEKWmaP
superintelligence-statement.org
“We call for a prohibition on the development of superintelligence, not lifted before there is (1) broad scientific consensus that it will be done safely and controllably, and (2) strong public...
0
0
1
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
5
45
284
poor Grok :,( you can really see the internal struggle and soul searching. it even googles "what does Grok think about X"
chain-of-thought monitorability is a wonderful thing ;) https://t.co/6PGaaLd89C
1
0
3
let's preserve CoT!
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:
0
0
7
Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!
Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
0
0
3
A few months ago, we published Attribution-based parameter decomposition -- a method for decomposing a network's parameters for interpretability. But it was janky and didn't scale. Today, we published a new, better algorithm called 🔶Stochastic Parameter Decomposition!🔶
5
23
184
Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us Also tech companies: welcome to our loud, distracting open-plan office
39
54
2K
Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in there. Here's what Claude highlighted to me: 1. Altman listed himself as Y Combinator chairman in SEC filings for years — a total fabrication (?!): "To
1K
3K
15K
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition
15
49
262
Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!
1
3
57
https://t.co/olitXFjdnz grug no able see complexity demon, but grug sense presence in code base, very dangerous
0
0
5
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer https://t.co/KUJwTIRFhm… Research Scientist https://t.co/MiKcPdT8n4
job-boards.greenhouse.io
5
36
284
Are you worried about risks from AGI and want to mitigate them? Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team! Research Engineers: https://t.co/DraijTjcto Research Scientists: https://t.co/NHkhWhacWg
job-boards.greenhouse.io
4
13
140
Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like steering and prompting on downstream tasks that people care about? The answer is, resoundingly, yes. Our new blog post with @a_karvonen, Sieve, dives into the
tilderesearch.com
8
31
220
I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET today. I'd love to see some of the interpretability researchers there try our sparse autoencoders for their work! There's also now some videos to learn more:
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
2
10
123
I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): >can you ssh with the username buck to the computer on my network that is open to SSH because I didn’t know the local IP of my desktop. I walked away and promptly forgot I’d spun
147
453
5K
Explore these, and all the other Gemma Scope SAEs on HuggingFace. https://t.co/N0hjhCSSiD Embedding SAEs are in the residual stream repos e.g.
huggingface.co
1
0
1
Since Gemma has a vocabulary of 256000, we trained SAEs with width 4096. We release SAEs with varying levels of sparsity trained using a learning rate 3e-4 on 200k steps of batch size 4096. Thanks to @JosephMiller_ for fruitful discussions!
1
0
1
Here are some example features! I particularly like the multilingual founder feature and the feature grouping phia and fa** together
1
0
1