Neel Nanda
@NeelNanda5
Followers
33K
Following
33K
Media
359
Statuses
5K
Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
London, UK
Joined June 2022
One of my most popular blog posts is on getting started in mech interp but it's super out of date. I've written v2! It's an opinionated, highly comprehensive, concrete guide to how to become a mech interp researcher And if you're interested, check out my MATS stream! Due Sep 12
3
54
587
It's my 27th birthday today! I do a lot of public facing work, and it's often hard to tell if any of it actually matters, or if I'm just talking into the void. If anything I've done has impacted you, it would really make my day to hear about it!
43
5
484
New video: I had some fun giving talks on the big picture of mech interp and how to think about the field, in my recent MATS training program. I'm releasing my favourites! First up: A stylised history of mech interp. The history of the field as I see it, until the rise of SAEs
2
6
136
Lovely work by @gersonkroiz and Greg Kocher: Previously, when we studied models being aware they're being evaluated they said it in the CoT, but this might be easier to fix. They got it working silently and showed our methods still stop it! Done as a 4 day practice project!
Excited to publish some cool findings we found during our 4 day mini project in Neel Nanda's MATS Exploration stream. We studied whether LLMs can be evaluation aware without explicit verbalization. Read more here https://t.co/JFymR7n9Ze
0
6
80
Highly recommend this 3-hour video. Makes me feel jealous of the researchers who get to explore model internals!
We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:
0
5
73
We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:
1
5
101
New video: What do models learn during fine-tuning? The nascent field of model diffing tries to answer this and more by comparing two versions of a model and studying the diff My MATS scholars @Butanium_ @jkminder and I discuss the field, and their two model diffing papers ⬇️
4
7
172
It was great to help with this interactive tutorial on SAEs, what they can be used for, and how they work. Fantastic work by the team!
Mapping LLMs with Sparse Autoencoders https://t.co/FIcY91YkS4 An interactive introduction to Sparse Autoencoders and their use cases with Nada Hussein, @shivamravalxai, Jimbo Wilson, Ari Alberich, @NeelNanda5, @iislucas, and @Nithum
0
13
137
Fantastic that ICML is only requiring virtual registration for accepted authors. This is far more inclusive and hopefully other conferences will follow suit
🎉ICML 2026 Call for Papers (& Position Papers) has arrived!🎉 A few key changes this year: - Attendance for authors of accepted papers is optional - Originally submitted version of accepted papers will be made public - Cap on # of papers one can be reciprocal reviewer for ...
1
6
145
Our surveys’ findings that AI researchers assign a median 5-10% to extinction or similar made a splash (NYT, NBC News, TIME..) But people sometimes underestimate our survey’s methodological quality due to various circulating misconceptions. Today, an FAQ correcting key errors:
2
15
79
Apply to the AGI Safety Fundamentals program! Due Sun Nov 9th IMO this online program is one of the best low commitment way to learn about what's going on in AGI Safety and how to help. I did it a few years ago and found it helpful, and advised on the new curriculum
11
22
254
Can LLMs introspect? Who knows! As an experiment, I recorded my first time reading @Jack_W_Lindsey's great paper on this and give a live review as I go TLDR: I see simple explanations for most things, but models detecting injected thoughts seems maybe like real introspection
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
25
19
238
This seems like awesome work - as AI creates greater opportunities for misuse, how can we use it find better ways to make society safer too?
In 2025, only 4 security vulnerabilities with CVEs were disclosed in OpenSSL = the crypto library securing most of the internet. AISLE @WeAreAisle's autonomous AI system discovered 3 out of the 4. And proposed the fixes that remediated them.
1
1
38
If you liked my 50 page Neel Nanda MATS application, you'll love my 59 page Neel Nanda MATS paper.
My fifty page neel nanda mats application will fr be one in the history records (I did not have time to format some super simple graphs and tables and just stacked them one after another. There's 16 pages of appendix that's just simple graphs and screenshots)
3
6
198
The core issue with studying eval awareness is that we lack a ground truth. Only the latest frontier models show significant eval awareness, and nothing open source. By designing a realistic model organism we can study this issue now!
1
0
18
I'm very excited about this paper. A problem with evaluating LLMs for safety is if they notice they're in a test they act well - this already happens! We made and release a realistic model that acts differently when tested, and subtracting the "I'm in a test" vector fixes this!
Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.
8
11
184
Great work! It's really nice to see serious attempts to determine what it means for a language model to believe something or not. I think current LLMs are approaching being coherent enough for "believe" to be a reasonable word, but there's a lot of room for conceptual clarity
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
7
16
173