
Neel Nanda
@NeelNanda5
Followers
28K
Following
32K
Media
320
Statuses
5K
Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
London, UK
Joined June 2022
Some solid work from my colleagues on GDM AGI Safety on measuring how good models are at scheming. It looks like we're not (yet) in much danger! I think forming better and more robust evals for these kinds of questions should be a big priority.
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme.
0
3
66
Sigh, looks like ICML is actually enforcing the hypocritical and uninclusive policy of rejecting accepted papers if authors are not able to get an in person registration. At least they have the decency to remind you and clarify you only need registration (contrary to the
It's a real shame that ICML has decided to automatically reject accepted papers if no author can attend ICML. A top conference paper is a significant boost to early career researchers, exactly the people least likely to be able to afford to go to a conference in Vancouver.
3
2
98
RT @NeelNanda5: IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier….
0
20
0
It is well known that CoT can have weird pathologies and biases, hiding a hidden computation. Knowledge is always useful, and I've helped with research here myself. But evidence is largely specific to when the hidden computation is simple, which aren't the most concerning cases.
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs. But does the CoT always explain how models answer?.No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick
1
0
6
Chain of thoughts is not just some post hoc rationalised BS. It is the intermediate state of the computation that produces the model's final answer. Just as analysing activations, the intermediate state of a single forward pass, can be principled, so can studying the thoughts.
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵
1
0
9
IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier models. CoT improves capabilities. Thoughts are intermediate state of computation. On the hardest tasks, they have real info. It's not perfect, but what is?.
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
6
20
140
Good news! There will be a mechanistic interpretability workshop at NeurIPS (Dec 6/7, San Diego). If you were disappointed that ICML rejected us, now we'll do an even better one: 4 more months of progress to discuss!. Papers likely due late August/early Sept, more info soon.
To anyone looking forward another ICML mech interp workshop, this will unfortunately not happen. We were rejected for delightfully logical reasons, like insufficient argument for why 'last year's workshop was very successful with a queue outside throughout' implies future success.
3
20
321
The 80,000 hours podcast is one of my favourite podcasts, I'm super excited to go on and attempt to share some useful takes! Let us know what kind of questions you'd be interested in.
I'll be soon interviewing the man the legend @NeelNanda5 — head of the Google DeepMind mechanistic interpretability team — for the 80,000 Hours Podcast. What should I ask him?. (Mech interp tries to figure out how AI models are thinking and why they do what they do.).
11
6
212