Emmanuel Ameisen
@mlpowered
Followers
10K
Following
6K
Media
279
Statuses
2K
Interpretability/Finetuning @AnthropicAI Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar
San Francisco, CA
Joined June 2017
We've made progress in our quest to understand how Claude and models like it think! The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
4
8
121
Striking result, which changed how I think about LLMs: When you change their activations, they can detect it and express what the change was. This indicates a deep awareness of their internal processing. LLMs can sometimes access their own thoughts
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
20
17
202
Cool paper using attribution graphs to automatically detect which reasoning steps contain mistakes! It seems like there are fundamental differences in the graphs for correct/incorrect steps. I'm really excited by methods to aggregate graphs, would love to see more such work.
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
3
13
135
Rev. Green from Harlem in today's AMNY: "One of those solutions is sitting in the City Council right now to allow outerborough homeowners to short-term rent. In the grand scheme of things, homeowners are advocating for reforms under Intro. 948A that make little fixes to Local Law
amny.com
New York City is in the middle of an affordability crisis like one it has not seen in a generation. In times of challenge — be they financial, social or
1
1
8
These results are fun, and they also point at how general representations inside the model can be! Features for eyes activate in text, ascii art, SVGs. Features for emotions affect drawings!
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
0
1
21
THIS IS SO FREAKING COOL LLMS CAN LITERALLY SEE BECAUSE TEXT HAS SPATIAL QUALITIES THAT'S HOW THEY MAKE ASCII ART
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
58
229
4K
It's lovely to see an old school, deep dive mech interp analysis on a real model! (Claude 3.5 Haiku). This is both much more convoluted and more comprehensible than I expected! And so pretty This seems the most complex behaviour yet understood at real depth, nice work!
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
5
19
364
AARP calls itself a nonprofit—but after a $9 billion payout from UnitedHealth, whose interests is it really serving?
10
19
79
What mechanisms do LLMs use to perceive their world? An exciting effort led by @wesg52 @mlpowered reveals beautiful structure in how Claude Haiku implements a fundamental "perceptual" task for an LLM: deciding when to start a new line of text.
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
2
3
14
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
309
2K
This makes me hopeful about interpretability. We found the features, traced the algorithm, and understood the mechanism. Next: making this easier and automated. Full paper: https://t.co/xz3OcMy00J
6
3
104
We think these findings could apply to other counting tasks because: When we train a toy model to pack counts optimally, it discovers a similar structure Empirically, many tasks seem to be using similar structure (table rows, dates...)
1
1
52
Welcome to Cytronic a robot-powered 3PL. No errors. No sick days. Just precision. See how much your business can save on fulfillment costs today.
0
6
33
And how does the model make these helices in the first place? To get enough curvature, it needs to sum up the results of many attention heads. We find that it uses 11 heads spread across 2 layers. Each head handles a subset of the line, and writes in a different direction.
1
3
87
The model does this multiple times with different offsets - like using two eyes or multiple cameras to get depth perception. Combining three offsets gives a precise estimate of characters left - sharp enough to decide if a 5-letter word fits in 3 remaining characters.
1
1
36
Remember, the model needs to compare its current position to the line limit. Both are on helices. The solution? Rotate one helix by a fixed offset, then measure how aligned they are. When they match → you're that many characters from the limit.
1
0
36
Looking at the geometry of these features, we discover clear structure: the model doesn't use independent directions for each position range. Instead, it is representing each potential position on a smooth 6D helix through embedding space.
1
1
47
Looking at more prompts, we find a family of features (directions in embedding space) representing position in the line. Each activates at a different position. These resemble "place cells" - neurons in mouse brains that fire at specific locations when navigating space.
2
2
43
When we trace the computation, we find the model tracking two things: where it is in the current line, and how long the previous line was. Then it compares them to decide if the next word fits. But how does it keep track of its position?
1
0
33
The task we study is knowing when to break the line in fixed-width text. We chose it for two reasons: While unconscious for humans (you just see when you're out of room), models don't have eyes - they only see tokens It is so common that models like Claude are very good at it
1
0
34
How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was: Put each number on a helix, and then twist one helix to compare it to the other. Not your first guess? Not ours either. 🧵
12
70
457
Claude Skills are awesome, maybe a bigger deal than MCP https://t.co/1wIYcTFrzI
simonwillison.net
Anthropic this morning introduced Claude Skills, a new pattern for making new abilities available to their models: Claude can now use Skills to improve how it performs specific tasks. Skills …
112
265
3K
tired: give an LLM a skill by fine-tuning it wired: give an LLM a skill by putting some files on its computer about how to be good at the skill, which it can then read at its convenience
We're launching Claude Agent Skills, a filesystem-based approach to extending Claude's capabilities. Progressive disclosure means agents load only relevant context. Bundle instructions, scripts, and resources in a folder. Claude discovers and executes what it needs.
31
93
2K