NeelNanda5 Profile Banner
Neel Nanda Profile
Neel Nanda

@NeelNanda5

Followers
28K
Following
32K
Media
319
Statuses
5K

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

London, UK
Joined June 2022
Don't wanna be here? Send us removal request.
@NeelNanda5
Neel Nanda
2 months
After supervising 20+ papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque. So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS
Tweet media one
18
177
2K
@NeelNanda5
Neel Nanda
3 hours
RT @NeelNanda5: IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier….
0
15
0
@NeelNanda5
Neel Nanda
9 hours
The real question you always need to ask in safety is the pragmatic question of quantifying the trade-off. What additional safety does this buy you, and how much is that offset by risking false sense of security? My guess is that, right now, the trade is extremely worth it.
0
2
12
@NeelNanda5
Neel Nanda
9 hours
CoT is not a perfect tool. But if you throw away tools when you find a flaw, you soon have no safety techniques left. This is ML not mathematics. Interpretability is an empirical science. We calibrate, combine imperfect tools, understand weaknesses and make our best guess.
1
2
19
@NeelNanda5
Neel Nanda
9 hours
All this reasoning could totally be wrong and this area badly needs more research. But it would be foolish to throw away such a promising tool. Further, CoT monitoring is shovel ready and high on my list of what to do if we had to deploy AGI right now with current safety research.
1
0
6
@NeelNanda5
Neel Nanda
9 hours
I'm also concerned that monitorability of CoT will eventually break, from new architectures, CoT ceasing to be in English, training on it or spooky situational awareness. We must explore alternatives, measure performance and not rely on it. But we should use it while it works.
2
0
3
@NeelNanda5
Neel Nanda
9 hours
Though, to be clear, in cases where the threat model *is* about simple hidden computation, like algorithmic bias in hiring, I'm very on board with the paper's critique.
1
0
3
@NeelNanda5
Neel Nanda
9 hours
It is well known that CoT can have weird pathologies and biases, hiding a hidden computation. Knowledge is always useful, and I've helped with research here myself. But evidence is largely specific to when the hidden computation is simple, which aren't the most concerning cases.
@IvanArcus
Iván Arcuschin
4 months
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs. But does the CoT always explain how models answer?.No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick
Tweet media one
1
0
5
@NeelNanda5
Neel Nanda
9 hours
In theory the info could be encoded, but models seem pretty bad at this, and it should impose a major capabilities tax. Empirically it's often in plain English. Either way, we should check!.
1
0
4
@NeelNanda5
Neel Nanda
9 hours
For easy tasks that the model could have done in a single forward pass, there's no requirement for faithful CoT and we should not rely on it. But the hardest and most dangerous tasks typically do need CoT, so there should be real information in the thoughts.
1
0
5
@NeelNanda5
Neel Nanda
9 hours
Chain of thoughts is not just some post hoc rationalised BS. It is the intermediate state of the computation that produces the model's final answer. Just as analysing activations, the intermediate state of a single forward pass, can be principled, so can studying the thoughts.
@paulcbogdan
Paul Bogdan
10 days
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵
1
0
4
@NeelNanda5
Neel Nanda
9 hours
Specifically, I think the precise technical claims of this position paper are largely correct, *and* consistent with my position. But the vibes are misleading. I've seen many interpret this as saying that reading CoT is an "illegitimate" way to interpret models.
1
0
9
@NeelNanda5
Neel Nanda
9 hours
IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier models. CoT improves capabilities. Thoughts are intermediate state of computation. On the hardest tasks, they have real info. It's not perfect, but what is?.
@FazlBarez
Fazl Barez
5 days
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Tweet media one
3
15
80
@NeelNanda5
Neel Nanda
1 day
0
0
12
@NeelNanda5
Neel Nanda
1 day
Exciting, mechanistic interpretability has a dedicated lecture in the syllabus of a Cambridge CS masters course! The field has come so far in the past few years ❤️
Tweet media one
3
16
262
@NeelNanda5
Neel Nanda
2 days
Sign up here to get an email when we officially launch the call for papers and share more info.
1
0
12
@NeelNanda5
Neel Nanda
2 days
Good news! There will be a mechanistic interpretability workshop at NeurIPS (Dec 6/7, San Diego). If you were disappointed that ICML rejected us, now we'll do an even better one: 4 more months of progress to discuss!. Papers likely due late August/early Sept, more info soon.
@NeelNanda5
Neel Nanda
4 months
To anyone looking forward another ICML mech interp workshop, this will unfortunately not happen. We were rejected for delightfully logical reasons, like insufficient argument for why 'last year's workshop was very successful with a queue outside throughout' implies future success.
3
19
298
@NeelNanda5
Neel Nanda
2 days
The 80,000 hours podcast is one of my favourite podcasts, I'm super excited to go on and attempt to share some useful takes! Let us know what kind of questions you'd be interested in.
@robertwiblin
Rob Wiblin
2 days
I'll be soon interviewing the man the legend @NeelNanda5 — head of the Google DeepMind mechanistic interpretability team — for the 80,000 Hours Podcast. What should I ask him?. (Mech interp tries to figure out how AI models are thinking and why they do what they do.).
10
5
199
@NeelNanda5
Neel Nanda
2 days
RT @NeelNanda5: @StephenLCasper I want to spread the meme of "chain of thought monitoring is great, people should be doing it way more and….
0
2
0
@NeelNanda5
Neel Nanda
4 days
I'm not familiar with most of the examples criticised in the paper, but eg the alignment faking paper is mentioned, and I think that one is a great example of CoT interpretability done well - models talking about alignment faking in their CoT is important evidence it's happening!.
5
1
41
@NeelNanda5
Neel Nanda
4 days
This feels somewhat overstated - it's obviously true that chain of thought isn't necessarily accurate, but there's no such thing as a perfectly accurate interpretability technique. CoT gives valuable info, and it's good that papers use it, so long as they don't blindly trust it.
@FazlBarez
Fazl Barez
5 days
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Tweet media one
10
6
182