Emmanuel Ameisen Profile
Emmanuel Ameisen

@mlpowered

Followers
10K
Following
5K
Media
268
Statuses
2K

Interpretability/Finetuning @AnthropicAI Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar

San Francisco, CA
Joined June 2017
Don't wanna be here? Send us removal request.
@mlpowered
Emmanuel Ameisen
5 months
We've made progress in our quest to understand how Claude and models like it think!. The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example.
@AnthropicAI
Anthropic
5 months
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
4
7
116
@mlpowered
Emmanuel Ameisen
5 days
0
2
20
@grok
Grok
4 days
Join millions who have switched to Grok.
211
240
2K
@mlpowered
Emmanuel Ameisen
5 days
Friday we published our no-jargon explainer of how LLMs "think". It's been great to see the response - already at 80k views!. We covered why models hallucinate, why they flatter users, and if they are just glorified autocomplete!. If you are curious about LLMs, check it out! 👇
Tweet media one
10
14
278
@mlpowered
Emmanuel Ameisen
8 days
If you're curious about the broader research landscape in the field of interpretability, and would like to dive into the weeds, experts from across the field came together to write this (more technical) overview.
Tweet card summary image
neuronpedia.org
A multi-organization interpretability project to replicate and extend circuit tracing research.
0
1
3
@mlpowered
Emmanuel Ameisen
8 days
In it, we discuss common mental models and where they fall short. For example, thinking of LLMs as next token predictors or as search engines over the training data isn't quite right!. Watch until the end to hear the answer to the real question: do models think like humans?.
1
0
1
@mlpowered
Emmanuel Ameisen
8 days
Think you know how LLMs work? Think again!. Today, most people have the wrong mental models about LLMs. But we do know *some* things about how they work!. So we made a video discussing what we know in a conversational format, for anyone who is curious.
@AnthropicAI
Anthropic
8 days
Join Anthropic interpretability researchers @thebasepoint, @mlpowered, and @Jack_W_Lindsey as they discuss looking into the mind of an AI model - and why it matters:
1
4
35
@mlpowered
Emmanuel Ameisen
12 days
RT @ch402: Our interpretability team is planning to mentor more fellows this cycle!. Applications are due Aug 17.
0
18
0
@mlpowered
Emmanuel Ameisen
15 days
RT @ludwigABAP: The "Circuit Analysis Research Landscape" for August 2025 is out and is an interesting read on "the landscape of interpreta….
0
13
0
@mlpowered
Emmanuel Ameisen
18 days
Researchers from Goodfire, Google DeepMind, Decode, Eleuther, and Anthropic wrote a post about tracing circuits in language models!. We cover how to train replacement models and compute graphs of model internals, and even filmed a 2-hour walkthrough of interpreting some examples!.
@neuronpedia
neuronpedia
18 days
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
0
1
19
@mlpowered
Emmanuel Ameisen
18 days
RT @GoodfireAI: New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source….
0
21
0
@mlpowered
Emmanuel Ameisen
22 days
In which the gang (@RunjinChen, @andyarditi, @Jack_W_Lindsey ):. - identifies vectors for bad personas (evil, sycophancy, hallucinations, etc).- shows that if you inject the bad vectors in training, the model learns to not do the bad thing!!. aka vaccines but for LLMs.
@AnthropicAI
Anthropic
22 days
New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
Tweet media one
5
9
92
@mlpowered
Emmanuel Ameisen
23 days
It turns out these heads are used in a variety of domains, like simple arithmetic prompts. There are more examples and details in the writeup, including investigating multiple choice circuits and induction.
Tweet media one
2
1
22
@mlpowered
Emmanuel Ameisen
23 days
This lets us do pretty targeted interventions. We can swap in the yellow feature for a red feature and see the model change its prediction!. Importantly, we only need to intervene on the input to the heads (not the residual stream), and only for the heads we identified)
Tweet media one
1
0
11
@mlpowered
Emmanuel Ameisen
23 days
One set of heads attends when the attribute matches the object, the other when it doesn't! . When a given head attends, it writes to the corresponding feature, and the model uses that feature to decide its output
Tweet media one
Tweet media two
1
0
9
@mlpowered
Emmanuel Ameisen
23 days
So how does the model know that one statement is plausible, and the other is discordant?. It turns out that there are "concordance heads" and "discordance heads" which attend from attributes on the query side (yellow) to object on the key side (banana)
Tweet media one
2
0
12
@mlpowered
Emmanuel Ameisen
23 days
The model has incorrect/correct answer which activate over the respective sentences. Where do these features come from?. It turns out they originate from these slightly more nuanced features about things that are either plausible, or discordant ("the bathroom is for eating")
Tweet media one
1
1
13
@mlpowered
Emmanuel Ameisen
23 days
Discordance heads!. How does the model decide if something is true?. If you take a simple example like the one below, where you ask if a banana is yellow or red, some interesting features show up
Tweet media one
1
0
9
@mlpowered
Emmanuel Ameisen
23 days
A key component of transformers is attention, which directs the flow of information from one token to another, and connects features. In this work, we explain attention patterns by decomposing them into a list of feature/feature interactions. We find neat things, for example
Tweet media one
1
0
15
@mlpowered
Emmanuel Ameisen
23 days
Earlier this year, we showed a method to interpret the intermediate steps a model takes to produce an answer. But we were missing a key bit of information: explaining why the model attends to specific concepts. Today, we do just that 🧵
Tweet media one
6
56
509
@mlpowered
Emmanuel Ameisen
24 days
RT @claudeai: You're absolutely right.
0
1K
0
@mlpowered
Emmanuel Ameisen
24 days
Lots of good research came from the first iteration of the program, including an open source mech interp library to trace circuits (. Recommend applying if you're interested!.
Tweet card summary image
github.com
Contribute to safety-research/circuit-tracer development by creating an account on GitHub.
@AnthropicAI
Anthropic
25 days
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
Tweet media one
0
1
21