NeelNanda5 Profile Banner
Neel Nanda Profile
Neel Nanda

@NeelNanda5

Followers
33K
Following
33K
Media
359
Statuses
5K

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

London, UK
Joined June 2022
Don't wanna be here? Send us removal request.
@NeelNanda5
Neel Nanda
2 months
One of my most popular blog posts is on getting started in mech interp but it's super out of date. I've written v2! It's an opinionated, highly comprehensive, concrete guide to how to become a mech interp researcher And if you're interested, check out my MATS stream! Due Sep 12
3
54
587
@NeelNanda5
Neel Nanda
21 hours
You can let me know in this form, or any other way of reaching me:
Tweet card summary image
docs.google.com
1
1
30
@NeelNanda5
Neel Nanda
21 hours
It's my 27th birthday today! I do a lot of public facing work, and it's often hard to tell if any of it actually matters, or if I'm just talking into the void. If anything I've done has impacted you, it would really make my day to hear about it!
43
5
484
@NeelNanda5
Neel Nanda
2 days
Video:
0
3
39
@NeelNanda5
Neel Nanda
2 days
New video: I had some fun giving talks on the big picture of mech interp and how to think about the field, in my recent MATS training program. I'm releasing my favourites! First up: A stylised history of mech interp. The history of the field as I see it, until the rise of SAEs
2
6
136
@NeelNanda5
Neel Nanda
2 days
Lovely work by @gersonkroiz and Greg Kocher: Previously, when we studied models being aware they're being evaluated they said it in the CoT, but this might be easier to fix. They got it working silently and showed our methods still stop it! Done as a 4 day practice project!
@gersonkroiz
Gerson Kroiz
2 days
Excited to publish some cool findings we found during our 4 day mini project in Neel Nanda's MATS Exploration stream. We studied whether LLMs can be evaluation aware without explicit verbalization. Read more here https://t.co/JFymR7n9Ze
0
6
80
@CFGeek
Charles Foster
3 days
Highly recommend this 3-hour video. Makes me feel jealous of the researchers who get to explore model internals!
@NeelNanda5
Neel Nanda
4 days
We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:
0
5
73
@NeelNanda5
Neel Nanda
4 days
We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:
1
5
101
@NeelNanda5
Neel Nanda
4 days
New video: What do models learn during fine-tuning? The nascent field of model diffing tries to answer this and more by comparing two versions of a model and studying the diff My MATS scholars @Butanium_ @jkminder and I discuss the field, and their two model diffing papers ⬇️
4
7
172
@NeelNanda5
Neel Nanda
5 days
It was great to help with this interactive tutorial on SAEs, what they can be used for, and how they work. Fantastic work by the team!
@emilyrreif
Emily Reif
7 days
Mapping LLMs with Sparse Autoencoders https://t.co/FIcY91YkS4 An interactive introduction to Sparse Autoencoders and their use cases with Nada Hussein, @shivamravalxai, Jimbo Wilson, Ari Alberich, @NeelNanda5, @iislucas, and @Nithum
0
13
137
@NeelNanda5
Neel Nanda
5 days
Fantastic that ICML is only requiring virtual registration for accepted authors. This is far more inclusive and hopefully other conferences will follow suit
@icmlconf
ICML Conference
6 days
🎉ICML 2026 Call for Papers (& Position Papers) has arrived!🎉 A few key changes this year: - Attendance for authors of accepted papers is optional - Originally submitted version of accepted papers will be made public - Cap on # of papers one can be reciprocal reviewer for ...
1
6
145
@AIImpacts
AI Impacts
10 days
Our surveys’ findings that AI researchers assign a median 5-10% to extinction or similar made a splash (NYT, NBC News, TIME..) But people sometimes underestimate our survey’s methodological quality due to various circulating misconceptions. Today, an FAQ correcting key errors:
2
15
79
@NeelNanda5
Neel Nanda
7 days
2
1
25
@NeelNanda5
Neel Nanda
7 days
Apply to the AGI Safety Fundamentals program! Due Sun Nov 9th IMO this online program is one of the best low commitment way to learn about what's going on in AGI Safety and how to help. I did it a few years ago and found it helpful, and advised on the new curriculum
11
22
254
@NeelNanda5
Neel Nanda
8 days
Watch it here: https://t.co/jjycAtiQHZ Paper here:
0
1
29
@NeelNanda5
Neel Nanda
8 days
Can LLMs introspect? Who knows! As an experiment, I recorded my first time reading @Jack_W_Lindsey's great paper on this and give a live review as I go TLDR: I see simple explanations for most things, but models detecting injected thoughts seems maybe like real introspection
@AnthropicAI
Anthropic
13 days
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
25
19
238
@NeelNanda5
Neel Nanda
11 days
This seems like awesome work - as AI creates greater opportunities for misuse, how can we use it find better ways to make society safer too?
@stanislavfort
Stanislav Fort
13 days
In 2025, only 4 security vulnerabilities with CVEs were disclosed in OpenSSL = the crypto library securing most of the internet. AISLE @WeAreAisle's autonomous AI system discovered 3 out of the 4. And proposed the fixes that remediated them.
1
1
38
@Tim_Hua_
Tim Hua 🇺🇦
12 days
If you liked my 50 page Neel Nanda MATS application, you'll love my 59 page Neel Nanda MATS paper.
@Tim_Hua_
Tim Hua 🇺🇦
8 months
My fifty page neel nanda mats application will fr be one in the history records (I did not have time to format some super simple graphs and tables and just stacked them one after another. There's 16 pages of appendix that's just simple graphs and screenshots)
3
6
198
@NeelNanda5
Neel Nanda
11 days
The core issue with studying eval awareness is that we lack a ground truth. Only the latest frontier models show significant eval awareness, and nothing open source. By designing a realistic model organism we can study this issue now!
1
0
18
@NeelNanda5
Neel Nanda
11 days
I'm very excited about this paper. A problem with evaluating LLMs for safety is if they notice they're in a test they act well - this already happens! We made and release a realistic model that acts differently when tested, and subtracting the "I'm in a test" vector fixes this!
@Tim_Hua_
Tim Hua 🇺🇦
12 days
Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.
8
11
184
@NeelNanda5
Neel Nanda
20 days
Great work! It's really nice to see serious attempts to determine what it means for a language model to believe something or not. I think current LLMs are approaching being coherent enough for "believe" to be a reasonable word, but there's a lot of room for conceptual clarity
@StewartSlocum1
Stewart Slocum
20 days
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
7
16
173