Michael Hanna Profile
Michael Hanna

@michaelwhanna

Followers
599
Following
371
Media
27
Statuses
69

PhD student at the University of Amsterdam / ILLC, interested in computational linguistics and (mechanistic) interpretability. Current Anthropic Fellow.

Berkeley, CA
Joined August 2019
Don't wanna be here? Send us removal request.
@michaelwhanna
Michael Hanna
2 months
RT @GoodfireAI: New research update! We replicated @AnthropicAI's circuit tracing methods to test if they can recover a known, simple trans….
0
53
0
@michaelwhanna
Michael Hanna
3 months
RT @Jack_W_Lindsey: We’re releasing an open-source library and public interactive interface for tracing the internal “thoughts” of a langua….
0
43
0
@grok
Grok
2 days
What do you want to know?.
131
59
523
@michaelwhanna
Michael Hanna
3 months
RT @mlpowered: The methods we used to trace the thoughts of Claude are now open to the public!. Today, we are releasing a library which let….
0
176
0
@michaelwhanna
Michael Hanna
3 months
We’re also excited to see other replications of transcoder circuit-finding work! EleutherAI has been building a library as well, which you can find here:
Tweet card summary image
github.com
Contribute to EleutherAI/attribute development by creating an account on GitHub.
1
0
6
@michaelwhanna
Michael Hanna
3 months
Big thanks as well to @johnnylin and @CurtTigges from @Neuronpedia, for hosting graphs + features and running autointerp on the transcoder features, making circuit-finding even easier! Thanks to @adamrpearce too for the awesome frontend for visualizing circuits!.
1
0
7
@michaelwhanna
Michael Hanna
3 months
Thanks also to @thebasepoint for your help and to fellow Fellow @andyarditi for pre-release testing! Thanks also to @Anthropic and @EthanJPerez for running the Anthropic Fellows Program - it's been a great environment for doing important safety research.
1
0
5
@michaelwhanna
Michael Hanna
3 months
Circuit-tracer uses the attribution method introduced in Anthropic's recent work ( and was built as part of the Anthropic Fellows Program. Thanks to our mentors, @mlpowered and @Jack_W_Lindsey for making this possible!.
Tweet card summary image
transformer-circuits.pub
We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.
1
0
10
@michaelwhanna
Michael Hanna
3 months
For now, circuit-tracer supports Gemma 2 (2B) and Llama 3.2 (1B), and uses single-layer transcoders, but is extensible to other models and transcoder architectures! Try it out on:. - Github: - Colab (click the badge!):
Tweet card summary image
github.com
Contribute to safety-research/circuit-tracer development by creating an account on GitHub.
1
0
9
@michaelwhanna
Michael Hanna
3 months
More whimsical features abound as well: we can find a "pirate" feature in Gemma, and clamp it on to make Gemma tell stories about pirates!
Tweet media one
1
0
10
@michaelwhanna
Michael Hanna
3 months
We can also perform interventions to check that our circuits are correct! Don't believe that Gemma uses a "French" feature to produce French results? We can turn it off and make Gemma output English - or turn a "Chinese" feature on, and get Chinese output!
Tweet media one
1
0
10
@michaelwhanna
Michael Hanna
3 months
This lets us find circuits that help explain complex model behaviors. For example, we can see that Gemma 2 (2B) solves the same task in multiple languages by using cross-lingual features - and adding language-specific features at the end.
Tweet media one
1
0
9
@michaelwhanna
Michael Hanna
3 months
Circuit-tracer then finds the features that are important by computing the exact effect that each feature has on each other feature and on the model's logits. We can then explore the features that are causally relevant, constructing a graph of features - that is, a circuit!
Tweet media one
1
0
14
@michaelwhanna
Michael Hanna
3 months
Circuit-tracer works by taking in a model and set of transcoders, which break down its internal activations into interpretable features. We determine the meaning of each feature by looking at the text that makes it activate the strongest - check out the Texas feature below!
Tweet media one
1
0
14
@michaelwhanna
Michael Hanna
3 months
RT @AnthropicAI: Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-s….
0
582
0
@michaelwhanna
Michael Hanna
4 months
I'll be presenting this in person at @naaclmeeting, tomorrow at 11am in Ballroom C! Come on by - I'd love to chat with folks about this and all things interp / cog sci!.
@michaelwhanna
Michael Hanna
8 months
Sentences are partially understood before they're fully read. How do LMs incrementally interpret their inputs?. In a new paper @amuuueller and I use mech interp to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10
Tweet media one
0
0
37
@michaelwhanna
Michael Hanna
4 months
RT @amuuueller: Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements ov….
0
38
0
@michaelwhanna
Michael Hanna
6 months
RT @tal_haklay: 1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a ke….
0
44
0
@michaelwhanna
Michael Hanna
8 months
Unexpectedly, we find that, when answering follow-up questions like "The boy fed the chicken smiled. Did the boy feed the chicken?", LMs don't repair or rely on earlier syntactic features! But they also don't generate new syntactic features. 9/10.
1
0
1
@michaelwhanna
Michael Hanna
8 months
What do LMs do when the ambiguity is resolved? Do they repair their initial representations—which could look like adding on to the circuit we've shown? Or do they reanalyze—for example, by ignoring that circuit and using new syntactic features? 8/10.
1
0
1