neuronpedia @neuronpedia X Profile

neuronpedia

@neuronpedia

Followers

884

Following

28

Media

23

Statuses

58

open source interpretability platform 🧠🧐

the residual stream

Joined July 2023

Don't wanna be here? Send us removal request.

neuronpedia

@neuronpedia

19 days

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

7

64

325

neuronpedia

@neuronpedia

19 days

[Re-uploading the GIF since initial one was truncated].To check our reasoning hypothesis, we can steer to interrupt specific steps:. Interrupting the "capital" step:.Dallas➡️Texas➡️❓➡️Texas. Interrupting the "Texas" step:.Dallas➡️❓➡️capital➡️Albany (New York's capital)

0

1

Grok

@grok

5 days

What do you want to know?.

386

238

2K

neuronpedia

@neuronpedia

19 days

RT @thebasepoint: Very cool collaboration between 5 labs that dug into circuit tracing after our paper in March. Sections on replications,….

0

6

0

neuronpedia

@neuronpedia

19 days

RT @ch402: Valuable synthesis across labs!. Make sure to check out the tutorial video -

0

8

0

neuronpedia

@neuronpedia

19 days

RT @GoodfireAI: New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source….

0

21

0

neuronpedia

@neuronpedia

19 days

We're grateful to researchers at Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode for sharing resources, knowledge, and open sourcing tools to make this collaboration possible. We look forward to continuing to accelerate interpretability research, together. 🔍🚀.

0

1

8

neuronpedia

@neuronpedia

19 days

Want to learn more? Watch the two-part "Attribution Graphs for Dummies", where Anthropic model biology researchers @Jack_W_Lindsey @mlpowered walk @NeelNanda5, @banburismus_ and you through a guided tutorial of circuit tracing.

2

9

neuronpedia

@neuronpedia

19 days

Try it yourself! Make your own attribution graphs to visualize the internal reasoning of any custom text prompt, for Gemma 2 and Qwen3 at

neuronpedia.org

Attribution Graph for undefined

1

2

6

neuronpedia

@neuronpedia

19 days

To check our reasoning hypothesis, we can steer to interrupt specific steps and observe that its output is affected as predicted. Interrupting the "capital" step:.Dallas ➡️ Texas ➡️ ❓ ➡️ Texas. Interrupting the "Texas" step:.Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital)

2

1

6

neuronpedia

@neuronpedia

19 days

To get at the "how", Anthropic introduced attribution graphs: visualizations that break down an LLM's reasoning into nodes (features) and links (connections). Giving Gemma 2 the same Austin query to generate a graph, we can trace its steps:.Dallas ➡️ Texas ➡️ capital ➡️ Austin

1

2

5

neuronpedia

@neuronpedia

19 days

While "what" methods can find specific concepts, they're insufficient for today's LLMs which exhibit reasoning capabilities. Eg: Asking Gemma 2 find "The capital of the state containing Dallas", we see a list of location related features, but not "how" it arrives at Austin.

1

2

3

neuronpedia

@neuronpedia

19 days

Often in interpretability, analysis of LLMs is done by observing the "what" of its internals. For example, what neurons/features fire when we ask an LLM to think about "tail wagging"? Here, the top feature is "training/handling dogs" - we can click to see that feature's details.

1

4

neuronpedia

@neuronpedia

19 days

Anthropic's model biology paper made a big splash in March. In this post, five interpretability orgs discuss new extensions, replications, and progress including: more efficient training, open problems, and research perspectives. Read the post here ➡️

neuronpedia.org

A multi-organization interpretability project to replicate and extend circuit tracing research.

1

3

17

neuronpedia

@neuronpedia

2 months

Blog post: Reminder: The Residual Stream newsletter (~1/month) is sent out to people who have a Neuronpedia account (free).

neuronpedia.org

Neuronpedia's First Anthropic Collaboration

0

1

neuronpedia

@neuronpedia

2 months

In the latest edition of The Residual Stream:.- New: Circuit Tracer x Anthropic.- New: Concise Auto-Interp Method.- Updates + Community Contributions. As usual, the detailed updates are posted on the blog/rss and newsletter. Link below ⬇️

1

0

4

neuronpedia

@neuronpedia

2 months

RT @a_karvonen: New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic detail….

0

22

0

neuronpedia

@neuronpedia

3 months

RT @swyx: I think this is the podcast that finally interp-pilled me. we snuck in a little intro featuring @johnnylin's @neuronpedia and ask….

0

7

0

neuronpedia

@neuronpedia

3 months

RT @NeelNanda5: Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with trans….

0

13

0

neuronpedia

@neuronpedia

3 months

RT @michaelwhanna: @mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!. Just type in a sent….

0

46

0

neuronpedia

@neuronpedia

3 months

RT @AnthropicAI: Researchers can use the Neuronpedia interactive interface here: And we’ve provided an annotated w….

github.com

Contribute to safety-research/circuit-tracer development by creating an account on GitHub.

0

64

0