neuronpedia @neuronpedia X Profile

neuronpedia

@neuronpedia

Followers

946

Following

29

Media

23

Statuses

59

open source interpretability platform 🧠🧐

https://t.co/1YADXNpvqT

the residual stream

Joined July 2023

Don't wanna be here? Send us removal request.

neuronpedia

@neuronpedia

3 months

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

7

67

330

Andy Arditi

@andyarditi

2 months

We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior.

lesswrong.com

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about t…

1

12

79

neuronpedia

@neuronpedia

3 months

[Re-uploading the GIF since initial one was truncated] To check our reasoning hypothesis, we can steer to interrupt specific steps: Interrupting the "capital" step: Dallas➡️Texas➡️❓➡️Texas Interrupting the "Texas" step: Dallas➡️❓➡️capital➡️Albany (New York's capital)

0

1

Joshua Batson

@thebasepoint

3 months

Very cool collaboration between 5 labs that dug into circuit tracing after our paper in March. Sections on replications, training trans/cross-coders, how attribution graphs compare to other methods, and open problems in interp!

neuronpedia

@neuronpedia

3 months

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

3

6

64

Chris Olah

@ch402

3 months

Valuable synthesis across labs! Make sure to check out the tutorial video -

neuronpedia

@neuronpedia

3 months

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

3

7

123

Goodfire

@GoodfireAI

3 months

New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7)

3

22

249

neuronpedia

@neuronpedia

3 months

We're grateful to researchers at Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode for sharing resources, knowledge, and open sourcing tools to make this collaboration possible. We look forward to continuing to accelerate interpretability research, together. 🔍🚀

0

1

8

neuronpedia

@neuronpedia

3 months

Want to learn more? Watch the two-part "Attribution Graphs for Dummies", where Anthropic model biology researchers @Jack_W_Lindsey @mlpowered walk @NeelNanda5, @banburismus_ and you through a guided tutorial of circuit tracing. https://t.co/cn6jINZpt9

2

9

neuronpedia

@neuronpedia

3 months

Try it yourself! Make your own attribution graphs to visualize the internal reasoning of any custom text prompt, for Gemma 2 and Qwen3 at https://t.co/mm6jugIVoo.

neuronpedia.org

Attribution Graph for undefined

1

2

6

neuronpedia

@neuronpedia

3 months

To check our reasoning hypothesis, we can steer to interrupt specific steps and observe that its output is affected as predicted. Interrupting the "capital" step: Dallas ➡️ Texas ➡️ ❓ ➡️ Texas Interrupting the "Texas" step: Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital)

2

1

6

neuronpedia

@neuronpedia

3 months

To get at the "how", Anthropic introduced attribution graphs: visualizations that break down an LLM's reasoning into nodes (features) and links (connections). Giving Gemma 2 the same Austin query to generate a graph, we can trace its steps: Dallas ➡️ Texas ➡️ capital ➡️ Austin

1

2

6

neuronpedia

@neuronpedia

3 months

While "what" methods can find specific concepts, they're insufficient for today's LLMs which exhibit reasoning capabilities. Eg: Asking Gemma 2 find "The capital of the state containing Dallas", we see a list of location related features, but not "how" it arrives at Austin.

1

2

4

neuronpedia

@neuronpedia

3 months

Often in interpretability, analysis of LLMs is done by observing the "what" of its internals. For example, what neurons/features fire when we ask an LLM to think about "tail wagging"? Here, the top feature is "training/handling dogs" - we can click to see that feature's details.

1

2

5

neuronpedia

@neuronpedia

3 months

Anthropic's model biology paper made a big splash in March. In this post, five interpretability orgs discuss new extensions, replications, and progress including: more efficient training, open problems, and research perspectives. Read the post here ➡️

neuronpedia.org

A multi-organization interpretability project to replicate and extend circuit tracing research.

1

3

18

neuronpedia

@neuronpedia

5 months

Blog post: https://t.co/r2XH5deve9 Reminder: The Residual Stream newsletter (~1/month) is sent out to people who have a Neuronpedia account (free).

neuronpedia.org

Neuronpedia's First Anthropic Collaboration

0

1

neuronpedia

@neuronpedia

5 months

In the latest edition of The Residual Stream: - New: Circuit Tracer x Anthropic - New: Concise Auto-Interp Method - Updates + Community Contributions As usual, the detailed updates are posted on the blog/rss and newsletter. Link below ⬇️

1

0

4

Adam Karvonen

@a_karvonen

5 months

New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7

5

22

147

swyx

@swyx

5 months

I think this is the podcast that finally interp-pilled me we snuck in a little intro featuring @johnnylin's @neuronpedia and asked about HOW IN THE HECK @anthropicai does all these insanely cracked interp visualizations for their "papers"

Latent.Space

@latentspacepod

5 months

🆕The Utility of Interpretability We sat down with @mlpowered of Anthropic's extremely popular latest mechinterp paper on Circuit Tracing to do a deep dive on the pod! Timestamps 00:00 Intro & Guest Introductions 01:00 Anthropic's Circuit Tracing Release 06:11 Exploring

2

7

35

Neel Nanda

@NeelNanda5

5 months

Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy

Anthropic

@AnthropicAI

5 months

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

0

13

228

Michael Hanna

@michaelwhanna

5 months

@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia: https://t.co/JYmcZz1f1J

Anthropic

@AnthropicAI

5 months

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

8

45

215