
Nikhil Prakash
@nikhil07prakash
Followers
771
Following
3K
Media
28
Statuses
1K
CS Ph.D. @KhouryCollege with @davidbau, working on DNN interpretability. Prev Intern at @Apple.
Boston, MA
Joined February 2017
How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it
9
97
568
In January, the release of a cheaply built Chinese AI model sent a shock wave across the globe. Now two teams of Khoury researchers say they've found the secrets that the AI doesn't want to show. To read more: https://t.co/zas6Bubb4f
1
2
9
Now officially out: "Re-evaluating Theory of Mind evaluation in large language models" https://t.co/5wXHLbnAlM (by Hu, Sosa, & me)
3
24
111
New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)
10
43
377
I’ll be in Cupertino near Apple Park next week and would love to connect with anyone working on (or interested in) mechanistic interpretability and/or theory of mind research in that part of the world. Feel free to send me a DM if you’d like to chat!
0
0
9
The final version of this paper has now been published in open access in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9
📄New preprint with Sam Musker, Alex Duchnowski & Ellie Pavlick @Brown_NLP! We investigate how humans subjects and LLMs perform on novel analogical reasoning tasks involving semantic structure-mapping. Our findings shed light on current LLMs' abilities and limitations. 1/
4
39
171
For a @GoodfireAI/@AnthropicAI meet-up later this month, I wrote a discussion doc: Assessing skeptical views of interpretability research Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.
8
25
303
1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other.
owls.baulab.info
Entangled tokens help explain subliminal learning.
18
72
663
Activation-based interpretability has a blind spot: it depends on the data you use to probe the model. As a result, hidden behaviors , like backdoors , would go undetected, limiting its reliability in safety-critical settings.
3
10
206
The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com
2
31
238
‼️🕚New paper alert with @ushabhalla_: Leveraging the Sequential Nature of Language for Interpretability ( https://t.co/VCNjWY6gtK)! 1/n
1
8
17
Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this matters🧵
5
7
52
🚨 New preprint! 🚨 Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?
2
24
237
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: https://t.co/mXjaMM12iv 📝 Register:
3
28
106
The new "Lookback" paper from @nikhil07prakash contains a surprising insight... 70b/405b LLMs use double pointers! Akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind. https://t.co/NkI0Poousl
How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it
1
10
58
LLM, reinventing age-old symbolic tools one step at a time
How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it
13
6
55
Find more details, including the causal intervention experiments and subspace analysis in the paper. Links to code and data are available on our website. 🌐: https://t.co/LvyoraI0xt 📜: https://t.co/Bu5WCOjD4U Joint work with @NatalieShapira @arnab_api @criedl @boknilev
arxiv.org
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM)...
2
3
38
We expect Lookback mechanism to extend beyond belief tracking - the concept of marking vital tokens seems universal across tasks requiring in-context manipulation. This mechanism could be fundamental to how LMs handle complex logical reasoning with conditionals.
1
0
17
We found that the LM generates a Visibility ID at the visibility sentence which serves as source info. Its address copy stays in-place, while a pointer copy flows to later lookback tokens. There, a QK-circuit dereferences the pointer to fetch info about the observed character as
1
0
15
Next we studied how providing explicit visibility condition affects characters' beliefs.
1
1
17
We test our high-level causal model using targeted causal interventions. - Patching the answer lookback pointer from counterfactual turns the output from coffee to beer (pink line). - Patching the answer lookback payload changes it from coffee to tea (grey line). Clear,
1
1
20