Nikhil Prakash @nikhil07prakash X Profile

Nikhil Prakash

@nikhil07prakash

Followers

771

Following

3K

Media

28

Statuses

1K

CS Ph.D. @KhouryCollege with @davidbau, working on DNN interpretability. Prev Intern at @Apple.

https://t.co/QoqqTDIiOT

Boston, MA

Joined February 2017

Don't wanna be here? Send us removal request.

Nikhil Prakash

@nikhil07prakash

3 months

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it

9

97

568

Khoury College of Computer Sciences

@KhouryCollege

3 days

In January, the release of a cheaply built Chinese AI model sent a shock wave across the globe. Now two teams of Khoury researchers say they've found the secrets that the AI doesn't want to show. To read more: https://t.co/zas6Bubb4f

1

2

9

Tomer Ullman

@TomerUllman

27 days

Now officially out: "Re-evaluating Theory of Mind evaluation in large language models" https://t.co/5wXHLbnAlM (by Hu, Sosa, & me)

3

24

111

Goodfire

@GoodfireAI

23 days

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)

10

43

377

Nikhil Prakash

@nikhil07prakash

29 days

I’ll be in Cupertino near Apple Park next week and would love to connect with anyone working on (or interested in) mechanistic interpretability and/or theory of mind research in that part of the world. Feel free to send me a DM if you’d like to chat!

0

9

Raphaël Millière

@raphaelmilliere

1 month

The final version of this paper has now been published in open access in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9

Raphaël Millière

@raphaelmilliere

1 year

📄New preprint with Sam Musker, Alex Duchnowski & Ellie Pavlick @Brown_NLP! We investigate how humans subjects and LLMs perform on novel analogical reasoning tasks involving semantic structure-mapping. Our findings shed light on current LLMs' abilities and limitations. 1/

4

39

171

Christopher Potts

@ChrisGPotts

1 month

For a @GoodfireAI/@AnthropicAI meet-up later this month, I wrote a discussion doc: Assessing skeptical views of interpretability research Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.

8

25

303

Amir Zur

@AmirZur2000

1 month

1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other.

owls.baulab.info

Entangled tokens help explain subliminal learning.

18

72

663

Aditi Raghunathan

@AdtRaghunathan

1 month

Activation-based interpretability has a blind spot: it depends on the data you use to probe the model. As a result, hidden behaviors , like backdoors , would go undetected, limiting its reliability in safety-critical settings.

3

10

206

Neel Nanda

@NeelNanda5

2 months

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

2

31

238

Alex Oesterling

@alex_oesterling

2 months

‼️🕚New paper alert with @ushabhalla_: Leveraging the Sequential Nature of Language for Interpretability ( https://t.co/VCNjWY6gtK)! 1/n

1

8

17

Michael Lutz

@Michael_J_Lutz

2 months

Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this matters🧵

5

7

52

Naomi Saphra

@nsaphra

2 months

🚨 New preprint! 🚨 Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?

2

24

237

Koyena Pal

@kpal_koyena

3 months

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: https://t.co/mXjaMM12iv 📝 Register:

3

28

106

David Bau

@davidbau

3 months

The new "Lookback" paper from @nikhil07prakash contains a surprising insight... 70b/405b LLMs use double pointers! Akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind. https://t.co/NkI0Poousl

Nikhil Prakash

@nikhil07prakash

3 months

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it

1

10

58

Gary Marcus

@GaryMarcus

3 months

LLM, reinventing age-old symbolic tools one step at a time

Nikhil Prakash

@nikhil07prakash

3 months

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it

13

6

55

Nikhil Prakash

@nikhil07prakash

3 months

Find more details, including the causal intervention experiments and subspace analysis in the paper. Links to code and data are available on our website. 🌐: https://t.co/LvyoraI0xt 📜: https://t.co/Bu5WCOjD4U Joint work with @NatalieShapira @arnab_api @criedl @boknilev

arxiv.org

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM)...

2

3

38

Nikhil Prakash

@nikhil07prakash

3 months

We expect Lookback mechanism to extend beyond belief tracking - the concept of marking vital tokens seems universal across tasks requiring in-context manipulation. This mechanism could be fundamental to how LMs handle complex logical reasoning with conditionals.

1

0

17

Nikhil Prakash

@nikhil07prakash

3 months

We found that the LM generates a Visibility ID at the visibility sentence which serves as source info. Its address copy stays in-place, while a pointer copy flows to later lookback tokens. There, a QK-circuit dereferences the pointer to fetch info about the observed character as

1

0

15

Nikhil Prakash

@nikhil07prakash

3 months

Next we studied how providing explicit visibility condition affects characters' beliefs.

1

17

Nikhil Prakash

@nikhil07prakash

3 months

We test our high-level causal model using targeted causal interventions. - Patching the answer lookback pointer from counterfactual turns the output from coffee to beer (pink line). - Patching the answer lookback payload changes it from coffee to tea (grey line). Clear,

1

20