neuronpedia Profile Banner
neuronpedia Profile
neuronpedia

@neuronpedia

Followers
884
Following
28
Media
23
Statuses
58

open source interpretability platform 🧠🧐

the residual stream
Joined July 2023
Don't wanna be here? Send us removal request.
@neuronpedia
neuronpedia
19 days
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️
7
64
325
@neuronpedia
neuronpedia
19 days
[Re-uploading the GIF since initial one was truncated].To check our reasoning hypothesis, we can steer to interrupt specific steps:. Interrupting the "capital" step:.Dallas➡️Texas➡️❓➡️Texas. Interrupting the "Texas" step:.Dallas➡️❓➡️capital➡️Albany (New York's capital)
0
0
1
@grok
Grok
5 days
What do you want to know?.
386
238
2K
@neuronpedia
neuronpedia
19 days
RT @thebasepoint: Very cool collaboration between 5 labs that dug into circuit tracing after our paper in March. Sections on replications,….
0
6
0
@neuronpedia
neuronpedia
19 days
RT @ch402: Valuable synthesis across labs!. Make sure to check out the tutorial video -
0
8
0
@neuronpedia
neuronpedia
19 days
RT @GoodfireAI: New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source….
0
21
0
@neuronpedia
neuronpedia
19 days
We're grateful to researchers at Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode for sharing resources, knowledge, and open sourcing tools to make this collaboration possible. We look forward to continuing to accelerate interpretability research, together. 🔍🚀.
0
1
8
@neuronpedia
neuronpedia
19 days
Want to learn more? Watch the two-part "Attribution Graphs for Dummies", where Anthropic model biology researchers @Jack_W_Lindsey @mlpowered walk @NeelNanda5, @banburismus_ and you through a guided tutorial of circuit tracing.
2
2
9
@neuronpedia
neuronpedia
19 days
Try it yourself! Make your own attribution graphs to visualize the internal reasoning of any custom text prompt, for Gemma 2 and Qwen3 at
Tweet card summary image
neuronpedia.org
Attribution Graph for undefined
1
2
6
@neuronpedia
neuronpedia
19 days
To check our reasoning hypothesis, we can steer to interrupt specific steps and observe that its output is affected as predicted. Interrupting the "capital" step:.Dallas ➡️ Texas ➡️ ❓ ➡️ Texas. Interrupting the "Texas" step:.Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital)
2
1
6
@neuronpedia
neuronpedia
19 days
To get at the "how", Anthropic introduced attribution graphs: visualizations that break down an LLM's reasoning into nodes (features) and links (connections). Giving Gemma 2 the same Austin query to generate a graph, we can trace its steps:.Dallas ➡️ Texas ➡️ capital ➡️ Austin
1
2
5
@neuronpedia
neuronpedia
19 days
While "what" methods can find specific concepts, they're insufficient for today's LLMs which exhibit reasoning capabilities. Eg: Asking Gemma 2 find "The capital of the state containing Dallas", we see a list of location related features, but not "how" it arrives at Austin.
1
2
3
@neuronpedia
neuronpedia
19 days
Often in interpretability, analysis of LLMs is done by observing the "what" of its internals. For example, what neurons/features fire when we ask an LLM to think about "tail wagging"? Here, the top feature is "training/handling dogs" - we can click to see that feature's details.
1
1
4
@neuronpedia
neuronpedia
19 days
Anthropic's model biology paper made a big splash in March. In this post, five interpretability orgs discuss new extensions, replications, and progress including: more efficient training, open problems, and research perspectives. Read the post here ➡️
Tweet card summary image
neuronpedia.org
A multi-organization interpretability project to replicate and extend circuit tracing research.
1
3
17
@neuronpedia
neuronpedia
2 months
Blog post: Reminder: The Residual Stream newsletter (~1/month) is sent out to people who have a Neuronpedia account (free).
Tweet card summary image
neuronpedia.org
Neuronpedia's First Anthropic Collaboration
0
0
1
@neuronpedia
neuronpedia
2 months
In the latest edition of The Residual Stream:.- New: Circuit Tracer x Anthropic.- New: Concise Auto-Interp Method.- Updates + Community Contributions. As usual, the detailed updates are posted on the blog/rss and newsletter. Link below ⬇️
1
0
4
@neuronpedia
neuronpedia
2 months
RT @a_karvonen: New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic detail….
0
22
0
@neuronpedia
neuronpedia
3 months
RT @swyx: I think this is the podcast that finally interp-pilled me. we snuck in a little intro featuring @johnnylin's @neuronpedia and ask….
0
7
0
@neuronpedia
neuronpedia
3 months
RT @NeelNanda5: Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with trans….
0
13
0
@neuronpedia
neuronpedia
3 months
RT @michaelwhanna: @mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!. Just type in a sent….
0
46
0
@neuronpedia
neuronpedia
3 months
RT @AnthropicAI: Researchers can use the Neuronpedia interactive interface here: And we’ve provided an annotated w….
Tweet card summary image
github.com
Contribute to safety-research/circuit-tracer development by creating an account on GitHub.
0
64
0