
Emmanuel Ameisen
@mlpowered
Followers
10K
Following
5K
Media
268
Statuses
2K
Interpretability/Finetuning @AnthropicAI Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar
San Francisco, CA
Joined June 2017
We've made progress in our quest to understand how Claude and models like it think!. The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example.
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
4
7
116
If you're curious about the broader research landscape in the field of interpretability, and would like to dive into the weeds, experts from across the field came together to write this (more technical) overview.
neuronpedia.org
A multi-organization interpretability project to replicate and extend circuit tracing research.
0
1
3
Think you know how LLMs work? Think again!. Today, most people have the wrong mental models about LLMs. But we do know *some* things about how they work!. So we made a video discussing what we know in a conversational format, for anyone who is curious.
Join Anthropic interpretability researchers @thebasepoint, @mlpowered, and @Jack_W_Lindsey as they discuss looking into the mind of an AI model - and why it matters:
1
4
35
RT @ch402: Our interpretability team is planning to mentor more fellows this cycle!. Applications are due Aug 17.
0
18
0
RT @ludwigABAP: The "Circuit Analysis Research Landscape" for August 2025 is out and is an interesting read on "the landscape of interpreta….
0
13
0
Researchers from Goodfire, Google DeepMind, Decode, Eleuther, and Anthropic wrote a post about tracing circuits in language models!. We cover how to train replacement models and compute graphs of model internals, and even filmed a 2-hour walkthrough of interpreting some examples!.
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
0
1
19
RT @GoodfireAI: New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source….
0
21
0
In which the gang (@RunjinChen, @andyarditi, @Jack_W_Lindsey ):. - identifies vectors for bad personas (evil, sycophancy, hallucinations, etc).- shows that if you inject the bad vectors in training, the model learns to not do the bad thing!!. aka vaccines but for LLMs.
New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
5
9
92
Lots of good research came from the first iteration of the program, including an open source mech interp library to trace circuits (. Recommend applying if you're interested!.
github.com
Contribute to safety-research/circuit-tracer development by creating an account on GitHub.
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
0
1
21