neuronpedia Profile Banner
neuronpedia Profile
neuronpedia

@neuronpedia

Followers
946
Following
29
Media
23
Statuses
59

open source interpretability platform 🧠🧐

the residual stream
Joined July 2023
Don't wanna be here? Send us removal request.
@neuronpedia
neuronpedia
3 months
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
7
67
330
@andyarditi
Andy Arditi
2 months
We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior.
Tweet card summary image
lesswrong.com
This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about t…
1
12
79
@neuronpedia
neuronpedia
3 months
[Re-uploading the GIF since initial one was truncated] To check our reasoning hypothesis, we can steer to interrupt specific steps: Interrupting the "capital" step: Dallas➡️Texas➡️❓➡️Texas Interrupting the "Texas" step: Dallas➡️❓➡️capital➡️Albany (New York's capital)
0
0
1
@thebasepoint
Joshua Batson
3 months
Very cool collaboration between 5 labs that dug into circuit tracing after our paper in March. Sections on replications, training trans/cross-coders, how attribution graphs compare to other methods, and open problems in interp!
@neuronpedia
neuronpedia
3 months
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
3
6
64
@ch402
Chris Olah
3 months
Valuable synthesis across labs! Make sure to check out the tutorial video -
@neuronpedia
neuronpedia
3 months
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
3
7
123
@GoodfireAI
Goodfire
3 months
New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7)
3
22
249
@neuronpedia
neuronpedia
3 months
We're grateful to researchers at Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode for sharing resources, knowledge, and open sourcing tools to make this collaboration possible. We look forward to continuing to accelerate interpretability research, together. 🔍🚀
0
1
8
@neuronpedia
neuronpedia
3 months
Want to learn more? Watch the two-part "Attribution Graphs for Dummies", where Anthropic model biology researchers @Jack_W_Lindsey @mlpowered walk @NeelNanda5, @banburismus_ and you through a guided tutorial of circuit tracing. https://t.co/cn6jINZpt9
2
2
9
@neuronpedia
neuronpedia
3 months
Try it yourself! Make your own attribution graphs to visualize the internal reasoning of any custom text prompt, for Gemma 2 and Qwen3 at https://t.co/mm6jugIVoo.
neuronpedia.org
Attribution Graph for undefined
1
2
6
@neuronpedia
neuronpedia
3 months
To check our reasoning hypothesis, we can steer to interrupt specific steps and observe that its output is affected as predicted. Interrupting the "capital" step: Dallas ➡️ Texas ➡️ ❓ ➡️ Texas Interrupting the "Texas" step: Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital)
2
1
6
@neuronpedia
neuronpedia
3 months
To get at the "how", Anthropic introduced attribution graphs: visualizations that break down an LLM's reasoning into nodes (features) and links (connections). Giving Gemma 2 the same Austin query to generate a graph, we can trace its steps: Dallas ➡️ Texas ➡️ capital ➡️ Austin
1
2
6
@neuronpedia
neuronpedia
3 months
While "what" methods can find specific concepts, they're insufficient for today's LLMs which exhibit reasoning capabilities. Eg: Asking Gemma 2 find "The capital of the state containing Dallas", we see a list of location related features, but not "how" it arrives at Austin.
1
2
4
@neuronpedia
neuronpedia
3 months
Often in interpretability, analysis of LLMs is done by observing the "what" of its internals. For example, what neurons/features fire when we ask an LLM to think about "tail wagging"? Here, the top feature is "training/handling dogs" - we can click to see that feature's details.
1
2
5
@neuronpedia
neuronpedia
3 months
Anthropic's model biology paper made a big splash in March. In this post, five interpretability orgs discuss new extensions, replications, and progress including: more efficient training, open problems, and research perspectives. Read the post here ➡️
Tweet card summary image
neuronpedia.org
A multi-organization interpretability project to replicate and extend circuit tracing research.
1
3
18
@neuronpedia
neuronpedia
5 months
Blog post: https://t.co/r2XH5deve9 Reminder: The Residual Stream newsletter (~1/month) is sent out to people who have a Neuronpedia account (free).
Tweet card summary image
neuronpedia.org
Neuronpedia's First Anthropic Collaboration
0
0
1
@neuronpedia
neuronpedia
5 months
In the latest edition of The Residual Stream: - New: Circuit Tracer x Anthropic - New: Concise Auto-Interp Method - Updates + Community Contributions As usual, the detailed updates are posted on the blog/rss and newsletter. Link below ⬇️
1
0
4
@a_karvonen
Adam Karvonen
5 months
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
5
22
147
@swyx
swyx
5 months
I think this is the podcast that finally interp-pilled me we snuck in a little intro featuring @johnnylin's @neuronpedia and asked about HOW IN THE HECK @anthropicai does all these insanely cracked interp visualizations for their "papers"
@latentspacepod
Latent.Space
5 months
🆕The Utility of Interpretability We sat down with @mlpowered of Anthropic's extremely popular latest mechinterp paper on Circuit Tracing to do a deep dive on the pod! Timestamps 00:00 Intro & Guest Introductions 01:00 Anthropic's Circuit Tracing Release 06:11 Exploring
2
7
35
@NeelNanda5
Neel Nanda
5 months
Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy
@AnthropicAI
Anthropic
5 months
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
0
13
228
@michaelwhanna
Michael Hanna
5 months
@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia: https://t.co/JYmcZz1f1J
@AnthropicAI
Anthropic
5 months
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
8
45
215