Emmanuel Ameisen
@mlpowered
Followers
10K
Following
6K
Media
281
Statuses
2K
Interpretability/Finetuning @AnthropicAI Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar
San Francisco, CA
Joined June 2017
We've made progress in our quest to understand how Claude and models like it think! The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
4
9
124
Anthropic Fellows have been responsible for some of the best research we've been able to do this year. If you are interested in directly helping the AI safety mission, you should apply!
We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.
0
1
22
Want your very own Claude wrapped? Just ask Claude to make it for you!
Had my Claude Wrapped instance put together a shareable prompt for anyone with a Pro/Max plan to make their own: https://t.co/dFPxasOSK0
0
1
9
Surreal to see interpretability research being featured on 60 minutes!
In an extreme stress test, Antropic’s AI models resorted to blackmail to avoid being shut down. Research scientist Joshua Batson shows @andersoncooper how it happened and what they learned from it. https://t.co/oDjW5iHujd
5
11
215
Ablating low-curvature components removes memorization while preserving reasoning. Different tasks show varying sensitivity: arithmetic & fact retrieval are brittle (rely on low-curvature), logical reasoning is robust! paper: https://t.co/2tFPi32fRs
0
3
18
By decomposing weights using loss curvature, you can identify components used for memorization vs generalization. High-curvature = shared mechanisms used across data. Low-curvature = idiosyncratic directions for memorized examples. You can then ablate the memorization weights!
LLMs memorize a lot of training data, but memorization is poorly understood. Where does it live inside models? How is it stored? How much is it involved in different tasks? @jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)
12
59
570
Neat use of attribution graphs to identify features in models which are responsible for jailbreak resistance and refusals. The graphs allow the author to make simple interventions to increase and decrease refusal rates
2
4
20
Striking result, which changed how I think about LLMs: When you change their activations, they can detect it and express what the change was. This indicates a deep awareness of their internal processing. LLMs can sometimes access their own thoughts
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
20
17
205
Cool paper using attribution graphs to automatically detect which reasoning steps contain mistakes! It seems like there are fundamental differences in the graphs for correct/incorrect steps. I'm really excited by methods to aggregate graphs, would love to see more such work.
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
3
14
134
These results are fun, and they also point at how general representations inside the model can be! Features for eyes activate in text, ascii art, SVGs. Features for emotions affect drawings!
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
0
1
22
THIS IS SO FREAKING COOL LLMS CAN LITERALLY SEE BECAUSE TEXT HAS SPATIAL QUALITIES THAT'S HOW THEY MAKE ASCII ART
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
57
224
4K
Discover perfection at the Barista Bar. Our beans are small-batch roasted in-house by skilled artisans, unlocking bold, nuanced flavors from premium sources. Fresh, aromatic brews daily. Taste the difference in every sip!
0
0
3
It's lovely to see an old school, deep dive mech interp analysis on a real model! (Claude 3.5 Haiku). This is both much more convoluted and more comprehensible than I expected! And so pretty This seems the most complex behaviour yet understood at real depth, nice work!
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
5
18
362
What mechanisms do LLMs use to perceive their world? An exciting effort led by @wesg52 @mlpowered reveals beautiful structure in how Claude Haiku implements a fundamental "perceptual" task for an LLM: deciding when to start a new line of text.
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
2
4
14
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
315
2K
This makes me hopeful about interpretability. We found the features, traced the algorithm, and understood the mechanism. Next: making this easier and automated. Full paper: https://t.co/xz3OcMy00J
6
4
104
The 2025 Limited First Edition Unity Bag has sold out within one week of launch. We are humbled and moved by the overwhelming response and love we have received. The Unity Bag was created as a quiet tribute to Fostering the Future and as a message of hope. It is the greatest
29
53
302
We think these findings could apply to other counting tasks because: When we train a toy model to pack counts optimally, it discovers a similar structure Empirically, many tasks seem to be using similar structure (table rows, dates...)
1
2
52
And how does the model make these helices in the first place? To get enough curvature, it needs to sum up the results of many attention heads. We find that it uses 11 heads spread across 2 layers. Each head handles a subset of the line, and writes in a different direction.
1
3
86
The model does this multiple times with different offsets - like using two eyes or multiple cameras to get depth perception. Combining three offsets gives a precise estimate of characters left - sharp enough to decide if a 5-letter word fits in 3 remaining characters.
1
1
36
Remember, the model needs to compare its current position to the line limit. Both are on helices. The solution? Rotate one helix by a fixed offset, then measure how aligned they are. When they match → you're that many characters from the limit.
1
0
36
Two top programs. One unforgettable night. Arkansas & Houston bring big time college basketball to the Garden State at Prudential Center on Saturday, December 20 for the 2025 Never Forget Tribute Classic! Buy your tickets today.
1
4
31
Looking at the geometry of these features, we discover clear structure: the model doesn't use independent directions for each position range. Instead, it is representing each potential position on a smooth 6D helix through embedding space.
1
1
47