
Neel Nanda
@NeelNanda5
Followers
28K
Following
32K
Media
319
Statuses
5K
Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
London, UK
Joined June 2022
RT @NeelNanda5: IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier….
0
15
0
It is well known that CoT can have weird pathologies and biases, hiding a hidden computation. Knowledge is always useful, and I've helped with research here myself. But evidence is largely specific to when the hidden computation is simple, which aren't the most concerning cases.
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs. But does the CoT always explain how models answer?.No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick
1
0
5
Chain of thoughts is not just some post hoc rationalised BS. It is the intermediate state of the computation that produces the model's final answer. Just as analysing activations, the intermediate state of a single forward pass, can be principled, so can studying the thoughts.
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵
1
0
4
IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier models. CoT improves capabilities. Thoughts are intermediate state of computation. On the hardest tasks, they have real info. It's not perfect, but what is?.
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
3
15
80
Good news! There will be a mechanistic interpretability workshop at NeurIPS (Dec 6/7, San Diego). If you were disappointed that ICML rejected us, now we'll do an even better one: 4 more months of progress to discuss!. Papers likely due late August/early Sept, more info soon.
To anyone looking forward another ICML mech interp workshop, this will unfortunately not happen. We were rejected for delightfully logical reasons, like insufficient argument for why 'last year's workshop was very successful with a queue outside throughout' implies future success.
3
19
298
The 80,000 hours podcast is one of my favourite podcasts, I'm super excited to go on and attempt to share some useful takes! Let us know what kind of questions you'd be interested in.
I'll be soon interviewing the man the legend @NeelNanda5 — head of the Google DeepMind mechanistic interpretability team — for the 80,000 Hours Podcast. What should I ask him?. (Mech interp tries to figure out how AI models are thinking and why they do what they do.).
10
5
199
RT @NeelNanda5: @StephenLCasper I want to spread the meme of "chain of thought monitoring is great, people should be doing it way more and….
0
2
0
This feels somewhat overstated - it's obviously true that chain of thought isn't necessarily accurate, but there's no such thing as a perfectly accurate interpretability technique. CoT gives valuable info, and it's good that papers use it, so long as they don't blindly trust it.
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
10
6
182