Neel Nanda @NeelNanda5 X Profile

Neel Nanda

@NeelNanda5

Followers

33K

Following

33K

Media

359

Statuses

5K

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

https://t.co/nGxdiE5mm8

London, UK

Joined June 2022

Don't wanna be here? Send us removal request.

Neel Nanda

@NeelNanda5

2 months

One of my most popular blog posts is on getting started in mech interp but it's super out of date. I've written v2! It's an opinionated, highly comprehensive, concrete guide to how to become a mech interp researcher And if you're interested, check out my MATS stream! Due Sep 12

3

54

587

Neel Nanda

@NeelNanda5

21 hours

You can let me know in this form, or any other way of reaching me:

docs.google.com

1

30

Neel Nanda

@NeelNanda5

21 hours

It's my 27th birthday today! I do a lot of public facing work, and it's often hard to tell if any of it actually matters, or if I'm just talking into the void. If anything I've done has impacted you, it would really make my day to hear about it!

43

5

484

Neel Nanda

@NeelNanda5

2 days

Video:

0

3

39

Neel Nanda

@NeelNanda5

2 days

New video: I had some fun giving talks on the big picture of mech interp and how to think about the field, in my recent MATS training program. I'm releasing my favourites! First up: A stylised history of mech interp. The history of the field as I see it, until the rise of SAEs

2

6

136

Neel Nanda

@NeelNanda5

2 days

Lovely work by @gersonkroiz and Greg Kocher: Previously, when we studied models being aware they're being evaluated they said it in the CoT, but this might be easier to fix. They got it working silently and showed our methods still stop it! Done as a 4 day practice project!

Gerson Kroiz

@gersonkroiz

2 days

Excited to publish some cool findings we found during our 4 day mini project in Neel Nanda's MATS Exploration stream. We studied whether LLMs can be evaluation aware without explicit verbalization. Read more here https://t.co/JFymR7n9Ze

0

6

80

Charles Foster

@CFGeek

3 days

Highly recommend this 3-hour video. Makes me feel jealous of the researchers who get to explore model internals!

Neel Nanda

@NeelNanda5

4 days

We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:

0

5

73

Neel Nanda

@NeelNanda5

4 days

We discuss their papers showing that model diffing is unexpectedly easy when fine-tuning in a narrow domain, and on finding and fixing flaws with crosscoders, a sparse autoencoder based approach Video:

1

5

101

Neel Nanda

@NeelNanda5

4 days

New video: What do models learn during fine-tuning? The nascent field of model diffing tries to answer this and more by comparing two versions of a model and studying the diff My MATS scholars @Butanium_ @jkminder and I discuss the field, and their two model diffing papers ⬇️

4

7

172

Neel Nanda

@NeelNanda5

5 days

It was great to help with this interactive tutorial on SAEs, what they can be used for, and how they work. Fantastic work by the team!

Emily Reif

@emilyrreif

7 days

Mapping LLMs with Sparse Autoencoders https://t.co/FIcY91YkS4 An interactive introduction to Sparse Autoencoders and their use cases with Nada Hussein, @shivamravalxai, Jimbo Wilson, Ari Alberich, @NeelNanda5, @iislucas, and @Nithum

0

13

137

Neel Nanda

@NeelNanda5

5 days

Fantastic that ICML is only requiring virtual registration for accepted authors. This is far more inclusive and hopefully other conferences will follow suit

ICML Conference

@icmlconf

6 days

🎉ICML 2026 Call for Papers (& Position Papers) has arrived!🎉 A few key changes this year: - Attendance for authors of accepted papers is optional - Originally submitted version of accepted papers will be made public - Cap on # of papers one can be reciprocal reviewer for ...

1

6

145

AI Impacts

@AIImpacts

10 days

Our surveys’ findings that AI researchers assign a median 5-10% to extinction or similar made a splash (NYT, NBC News, TIME..) But people sometimes underestimate our survey’s methodological quality due to various circulating misconceptions. Today, an FAQ correcting key errors:

2

15

79

Neel Nanda

@NeelNanda5

7 days

Apply here https://t.co/H64ImfsVCb

2

1

25

Neel Nanda

@NeelNanda5

7 days

Apply to the AGI Safety Fundamentals program! Due Sun Nov 9th IMO this online program is one of the best low commitment way to learn about what's going on in AGI Safety and how to help. I did it a few years ago and found it helpful, and advised on the new curriculum

11

22

254

Neel Nanda

@NeelNanda5

8 days

Watch it here: https://t.co/jjycAtiQHZ Paper here:

0

1

29

Neel Nanda

@NeelNanda5

8 days

Can LLMs introspect? Who knows! As an experiment, I recorded my first time reading @Jack_W_Lindsey's great paper on this and give a live review as I go TLDR: I see simple explanations for most things, but models detecting injected thoughts seems maybe like real introspection

Anthropic

@AnthropicAI

13 days

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

25

19

238

Neel Nanda

@NeelNanda5

11 days

This seems like awesome work - as AI creates greater opportunities for misuse, how can we use it find better ways to make society safer too?

Stanislav Fort

@stanislavfort

13 days

In 2025, only 4 security vulnerabilities with CVEs were disclosed in OpenSSL = the crypto library securing most of the internet. AISLE @WeAreAisle's autonomous AI system discovered 3 out of the 4. And proposed the fixes that remediated them.

1

38

Tim Hua 🇺🇦

@Tim_Hua_

12 days

If you liked my 50 page Neel Nanda MATS application, you'll love my 59 page Neel Nanda MATS paper.

Tim Hua 🇺🇦

@Tim_Hua_

8 months

My fifty page neel nanda mats application will fr be one in the history records (I did not have time to format some super simple graphs and tables and just stacked them one after another. There's 16 pages of appendix that's just simple graphs and screenshots)

3

6

198

Neel Nanda

@NeelNanda5

11 days

The core issue with studying eval awareness is that we lack a ground truth. Only the latest frontier models show significant eval awareness, and nothing open source. By designing a realistic model organism we can study this issue now!

1

0

18

Neel Nanda

@NeelNanda5

11 days

I'm very excited about this paper. A problem with evaluating LLMs for safety is if they notice they're in a test they act well - this already happens! We made and release a realistic model that acts differently when tested, and subtracting the "I'm in a test" vector fixes this!

Tim Hua 🇺🇦

@Tim_Hua_

12 days

Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.

8

11

184

Neel Nanda

@NeelNanda5

20 days

Great work! It's really nice to see serious attempts to determine what it means for a language model to believe something or not. I think current LLMs are approaching being coherent enough for "believe" to be a reasonable word, but there's a lot of room for conceptual clarity

Stewart Slocum

@StewartSlocum1

20 days

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

7

16

173