NeelNanda5 Profile Banner
Neel Nanda Profile
Neel Nanda

@NeelNanda5

Followers
28K
Following
32K
Media
320
Statuses
5K

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

London, UK
Joined June 2022
Don't wanna be here? Send us removal request.
@NeelNanda5
Neel Nanda
2 months
After supervising 20+ papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque. So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS
Tweet media one
19
178
2K
@NeelNanda5
Neel Nanda
1 day
Some solid work from my colleagues on GDM AGI Safety on measuring how good models are at scheming. It looks like we're not (yet) in much danger! I think forming better and more robust evals for these kinds of questions should be a big priority.
@vkrakovna
Victoria Krakovna
1 day
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme.
Tweet media one
0
3
66
@NeelNanda5
Neel Nanda
3 days
I'm fortunate enough to be able to get a registration myself if need be, I imagine one of my co authors just forgot, but many others are not so fortunate. For anyone who wants to defend this policy, here's a thread of my objections to common ones I've heard:.
0
0
15
@NeelNanda5
Neel Nanda
3 days
Sigh, looks like ICML is actually enforcing the hypocritical and uninclusive policy of rejecting accepted papers if authors are not able to get an in person registration. At least they have the decency to remind you and clarify you only need registration (contrary to the
Tweet media one
@NeelNanda5
Neel Nanda
2 months
It's a real shame that ICML has decided to automatically reject accepted papers if no author can attend ICML. A top conference paper is a significant boost to early career researchers, exactly the people least likely to be able to afford to go to a conference in Vancouver.
Tweet media one
3
2
98
@NeelNanda5
Neel Nanda
3 days
RT @NeelNanda5: IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier….
0
20
0
@NeelNanda5
Neel Nanda
3 days
The real question you always need to ask in safety is the pragmatic question of quantifying the trade-off. What additional safety does this buy you, and how much is that offset by risking false sense of security? My guess is that, right now, the trade is extremely worth it.
0
2
16
@NeelNanda5
Neel Nanda
3 days
CoT is not a perfect tool. But if you throw away tools when you find a flaw, you soon have no safety techniques left. This is ML not mathematics. Interpretability is an empirical science. We calibrate, combine imperfect tools, understand weaknesses and make our best guess.
1
2
22
@NeelNanda5
Neel Nanda
3 days
All this reasoning could totally be wrong and this area badly needs more research. But it would be foolish to throw away such a promising tool. Further, CoT monitoring is shovel ready and high on my list of what to do if we had to deploy AGI right now with current safety research.
1
0
8
@NeelNanda5
Neel Nanda
3 days
I'm also concerned that monitorability of CoT will eventually break, from new architectures, CoT ceasing to be in English, training on it or spooky situational awareness. We must explore alternatives, measure performance and not rely on it. But we should use it while it works.
2
0
6
@NeelNanda5
Neel Nanda
3 days
Though, to be clear, in cases where the threat model *is* about simple hidden computation, like algorithmic bias in hiring, I'm very on board with the paper's critique.
1
0
4
@NeelNanda5
Neel Nanda
3 days
It is well known that CoT can have weird pathologies and biases, hiding a hidden computation. Knowledge is always useful, and I've helped with research here myself. But evidence is largely specific to when the hidden computation is simple, which aren't the most concerning cases.
@IvanArcus
Iván Arcuschin
4 months
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs. But does the CoT always explain how models answer?.No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick
Tweet media one
1
0
6
@NeelNanda5
Neel Nanda
3 days
In theory the info could be encoded, but models seem pretty bad at this, and it should impose a major capabilities tax. Empirically it's often in plain English. Either way, we should check!.
1
0
8
@NeelNanda5
Neel Nanda
3 days
For easy tasks that the model could have done in a single forward pass, there's no requirement for faithful CoT and we should not rely on it. But the hardest and most dangerous tasks typically do need CoT, so there should be real information in the thoughts.
2
1
9
@NeelNanda5
Neel Nanda
3 days
Chain of thoughts is not just some post hoc rationalised BS. It is the intermediate state of the computation that produces the model's final answer. Just as analysing activations, the intermediate state of a single forward pass, can be principled, so can studying the thoughts.
@paulcbogdan
Paul Bogdan
13 days
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵
1
0
9
@NeelNanda5
Neel Nanda
3 days
Specifically, I think the precise technical claims of this position paper are largely correct, *and* consistent with my position. But the vibes are misleading. I've seen many interpret this as saying that reading CoT is an "illegitimate" way to interpret models.
1
0
14
@NeelNanda5
Neel Nanda
3 days
IMO chain of thought monitoring is actually pretty great. It gives additional safety and should be widely used on frontier models. CoT improves capabilities. Thoughts are intermediate state of computation. On the hardest tasks, they have real info. It's not perfect, but what is?.
@FazlBarez
Fazl Barez
8 days
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Tweet media one
6
20
140
@NeelNanda5
Neel Nanda
4 days
0
0
13
@NeelNanda5
Neel Nanda
4 days
Exciting, mechanistic interpretability has a dedicated lecture in the syllabus of a Cambridge CS masters course! The field has come so far in the past few years ❤️
Tweet media one
3
17
271
@NeelNanda5
Neel Nanda
5 days
Sign up here to get an email when we officially launch the call for papers and share more info.
2
0
13
@NeelNanda5
Neel Nanda
5 days
Good news! There will be a mechanistic interpretability workshop at NeurIPS (Dec 6/7, San Diego). If you were disappointed that ICML rejected us, now we'll do an even better one: 4 more months of progress to discuss!. Papers likely due late August/early Sept, more info soon.
@NeelNanda5
Neel Nanda
4 months
To anyone looking forward another ICML mech interp workshop, this will unfortunately not happen. We were rejected for delightfully logical reasons, like insufficient argument for why 'last year's workshop was very successful with a queue outside throughout' implies future success.
3
20
321
@NeelNanda5
Neel Nanda
5 days
The 80,000 hours podcast is one of my favourite podcasts, I'm super excited to go on and attempt to share some useful takes! Let us know what kind of questions you'd be interested in.
@robertwiblin
Rob Wiblin
5 days
I'll be soon interviewing the man the legend @NeelNanda5 — head of the Google DeepMind mechanistic interpretability team — for the 80,000 Hours Podcast. What should I ask him?. (Mech interp tries to figure out how AI models are thinking and why they do what they do.).
11
6
212