 
            
              neuronpedia
            
            @neuronpedia
Followers
                946
              Following
                29
              Media
                23
              Statuses
                59
              open source interpretability platform 🧠🧐
              
              the residual stream
            
            
              
              Joined July 2023
            
            
           Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️ 
          
                
                7
              
              
                
                67
              
              
                
                330
              
             We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior. 
          
            
            lesswrong.com
              This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about t…
            
                
                1
              
              
                
                12
              
              
                
                79
              
             [Re-uploading the GIF since initial one was truncated] To check our reasoning hypothesis, we can steer to interrupt specific steps: Interrupting the "capital" step: Dallas➡️Texas➡️❓➡️Texas Interrupting the "Texas" step: Dallas➡️❓➡️capital➡️Albany (New York's capital) 
          
                
                0
              
              
                
                0
              
              
                
                1
              
             Very cool collaboration between 5 labs that dug into circuit tracing after our paper in March. Sections on replications, training trans/cross-coders, how attribution graphs compare to other methods, and open problems in interp! 
           Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️ 
            
                
                3
              
              
                
                6
              
              
                
                64
              
             Valuable synthesis across labs! Make sure to check out the tutorial video - 
           Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️ 
            
                
                3
              
              
                
                7
              
              
                
                123
              
             New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7) 
          
                
                3
              
              
                
                22
              
              
                
                249
              
             We're grateful to researchers at Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode for sharing resources, knowledge, and open sourcing tools to make this collaboration possible. We look forward to continuing to accelerate interpretability research, together. 🔍🚀 
          
                
                0
              
              
                
                1
              
              
                
                8
              
             Want to learn more? Watch the two-part "Attribution Graphs for Dummies", where Anthropic model biology researchers @Jack_W_Lindsey @mlpowered walk @NeelNanda5, @banburismus_ and you through a guided tutorial of circuit tracing.  https://t.co/cn6jINZpt9 
          
          
                
                2
              
              
                
                2
              
              
                
                9
              
             Try it yourself! Make your own attribution graphs to visualize the internal reasoning of any custom text prompt, for Gemma 2 and Qwen3 at  https://t.co/mm6jugIVoo. 
          
          
            
            neuronpedia.org
              Attribution Graph for undefined
            
                
                1
              
              
                
                2
              
              
                
                6
              
             To check our reasoning hypothesis, we can steer to interrupt specific steps and observe that its output is affected as predicted. Interrupting the "capital" step: Dallas ➡️ Texas ➡️ ❓ ➡️ Texas Interrupting the "Texas" step: Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital) 
          
                
                2
              
              
                
                1
              
              
                
                6
              
             To get at the "how", Anthropic introduced attribution graphs: visualizations that break down an LLM's reasoning into nodes (features) and links (connections). Giving Gemma 2 the same Austin query to generate a graph, we can trace its steps: Dallas ➡️ Texas ➡️ capital ➡️ Austin 
          
                
                1
              
              
                
                2
              
              
                
                6
              
             While "what" methods can find specific concepts, they're insufficient for today's LLMs which exhibit reasoning capabilities. Eg: Asking Gemma 2 find "The capital of the state containing Dallas", we see a list of location related features, but not "how" it arrives at Austin. 
          
                
                1
              
              
                
                2
              
              
                
                4
              
             Often in interpretability, analysis of LLMs is done by observing the "what" of its internals. For example, what neurons/features fire when we ask an LLM to think about "tail wagging"? Here, the top feature is "training/handling dogs" - we can click to see that feature's details. 
          
                
                1
              
              
                
                2
              
              
                
                5
              
             Anthropic's model biology paper made a big splash in March. In this post, five interpretability orgs discuss new extensions, replications, and progress including: more efficient training, open problems, and research perspectives. Read the post here ➡️ 
          
            
            neuronpedia.org
              A multi-organization interpretability project to replicate and extend circuit tracing research.
            
                
                1
              
              
                
                3
              
              
                
                18
              
             Blog post:  https://t.co/r2XH5deve9  Reminder: The Residual Stream newsletter (~1/month) is sent out to people who have a Neuronpedia account (free). 
          
            
            neuronpedia.org
              Neuronpedia's First Anthropic Collaboration
            
                
                0
              
              
                
                0
              
              
                
                1
              
             In the latest edition of The Residual Stream: - New: Circuit Tracer x Anthropic - New: Concise Auto-Interp Method - Updates + Community Contributions As usual, the detailed updates are posted on the blog/rss and newsletter. Link below ⬇️ 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7 
          
                
                5
              
              
                
                22
              
              
                
                147
              
             I think this is the podcast that finally interp-pilled me we snuck in a little intro featuring @johnnylin's @neuronpedia and asked about HOW IN THE HECK @anthropicai does all these insanely cracked interp visualizations for their "papers" 
           🆕The Utility of Interpretability We sat down with @mlpowered of Anthropic's extremely popular latest mechinterp paper on Circuit Tracing to do a deep dive on the pod! Timestamps 00:00 Intro & Guest Introductions 01:00 Anthropic's Circuit Tracing Release 06:11 Exploring 
            
                
                2
              
              
                
                7
              
              
                
                35
              
             Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy
          
           Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively. 
          
                
                0
              
              
                
                13
              
              
                
                228
              
            
            @mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia:  https://t.co/JYmcZz1f1J 
          
           Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively. 
          
                
                8
              
              
                
                45
              
              
                
                215
              
             
             
             
             
             
             
             
               
             
              