Michael Pearce
@_MichaelPearce
Followers
171
Following
619
Media
7
Statuses
87
Mechanistic Interpretability @ Goodfire | Physics | Evolution
Joined September 2015
Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper? Our new research with @RakutenGroup on PII detection finds that SAE probes: - transfer from synthetic to real data better than normal probes - match GPT-5 Mini performance at 1/15 the cost (1/6)
12
48
327
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
315
2K
Agents for experimental research != agents for software development. This is a key lesson we've learned after several months refining agentic workflows! More takeaways on effectively using experimenter agents + a key tool we're open-sourcing to enable them: 🧵
3
30
221
We're excited to announce a collaboration with @MayoClinic! We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models. That could mean novel biomarkers, personalized diagnostics, & more.
3
10
73
Does making an SAE bigger let you explain more of your model's features? New research from @ericjmichaud_ models SAE scaling dynamics, and explores whether SAEs will pack increasingly many latents onto a few multidimensional features, rather than learning more features.
3
19
153
The structure seems consistent with the manifold with “ripples” picture seen in LLMs. Finding similar patterns across diverse models hints at a general organizing principle behind feature geometry. Looking forward to characterizing more biological structures in genomic models!
0
0
7
Excited to share our work digging into how Evo 2 represents species relatedness or phylogeny. Genetics provides a good quantitative measure of relatedness, so we could use it to probe the model and see if its internal geometry reflects it.
Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world? Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)
1
8
46
Adversarial examples - a vulnerability of every AI model, and a “mystery” of deep learning - may simply come from models cramming many features into the same neurons! Less feature interference → more robust models. New research from @livgorton 🧵 (1/4)
4
25
251
New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)
10
40
372
Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning
14
53
518
New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7)
3
22
248
Just wrote a piece on why I believe interpretability is AI’s most important frontier - we're building the most powerful technology in history, but still can't reliably engineer or understand our models. With rapidly improving model capabilities, interpretability is more urgent,
1
17
138
(1/7) New research: how can we understand how an AI model actually works? Our method, SPD, decomposes the *parameters* of neural networks, rather than their activations - akin to understanding a program by reverse-engineering the source code vs. inspecting runtime behavior.
13
86
790
New research update! We replicated @AnthropicAI's circuit tracing methods to test if they can recover a known, simple transformer mechanism.
2
53
503
“There is no wave function...” This claim by Jacob Barandes sounds outlandish, but allow me to justify it with a blend of intuition regarding physics and rigor regarding math. We'll dispel some quantum woo myths along the way. (1/13)
60
146
1K
We created a canvas that plugs into an image model’s brain. You can use it to generate images in real-time by painting with the latent concepts the model has learned. Try out Paint with Ember for yourself 👇
39
99
927
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇
4
35
193
In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network? Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time" 1/14
145
908
7K
Do SAEs find the ‘true’ features in LLMs? In our ICLR paper w/ @neelnanda5 we argue no The issue: we must choose the number of concepts learned. Small SAEs miss low-level concepts, but large SAEs miss high-level concepts - it’s sparser to compose them into low-level concepts
3
38
272
MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵
5
32
301