 
            
              Daniel Johnson
            
            @_ddjohnson
Followers
                3K
              Following
                8K
              Media
                41
              Statuses
                277
              Member of Technical Staff at @TransluceAI. Building tools to study neural nets and their behaviors. He/him.
              
              San Francisco
            
            
              
              Joined May 2010
            
            
           We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team. 
          
                
                1
              
              
                
                9
              
              
                
                26
              
             If you're seriously trying to understand AGI, core concepts you should familiarize yourself with: 
          
                
                6
              
              
                
                8
              
              
                
                55
              
             We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead. 
           Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code! 
            
                
                1
              
              
                
                13
              
              
                
                75
              
             At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/) 
          
                
                5
              
              
                
                39
              
              
                
                244
              
             Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code! 
          
                
                6
              
              
                
                35
              
              
                
                200
              
             When some people talk about future AIs, they sometimes jump straight to modelling them as fully independent and sovereign agents; new principals with their own objectives and values. They sometimes skip over how today's models actually work, on the grounds that eventually we’ll 
          
                
                10
              
              
                
                21
              
              
                
                117
              
             At #ICML2025? Come chat about investigator agents and model behavior with @ChowdhuryNeil and @_ddjohnson at West Exhibition Hall #1012, now until 1:30pm 
          
                
                0
              
              
                
                3
              
              
                
                16
              
             I'll be at ICML! Stop by our Thursday morning poster to hear about our investigator agents. Also excited to talk to people about understanding LM behaviors and personas during the conference! Feel free to reach out, DMs open! 
           We'll be at #ICML2025 🇨🇦 this week! Here are a few places you can find us: Monday: Jacob (@JacobSteinhardt) speaking at Post-AGI Civilizational Equilibria (  https://t.co/wtratbvRnF)  Wednesday: Sarah (@cogconfluence) speaking at @WiMLworkshop at 10:15 and as a panelist at 11am 
          
                
                0
              
              
                
                2
              
              
                
                21
              
             We'll be at #ICML2025 🇨🇦 this week! Here are a few places you can find us: Monday: Jacob (@JacobSteinhardt) speaking at Post-AGI Civilizational Equilibria (  https://t.co/wtratbvRnF)  Wednesday: Sarah (@cogconfluence) speaking at @WiMLworkshop at 10:15 and as a panelist at 11am 
          
                
                1
              
              
                
                7
              
              
                
                40
              
             Building a science of model understanding that addresses real-world problems is one of the key AI challenges of our time. I'm so excited this workshop is happening! See you at #ICML2025 ✨ 
           Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨ 
            
                
                0
              
              
                
                4
              
              
                
                37
              
            
            @ESYudkowsky That's a good alternate title for the paper. It's full of quantitative and qualitative evidence that Opus 3 is different in ways that I think you'll find particularly important. In almost all experiment variations, Opus 3 consistently BOTH: - complies sometimes with the training
          
          
                
                2
              
              
                
                10
              
              
                
                107
              
             Coming to ICML and interested in understanding models and their behaviors? Stop by Transluce's happy hour on Thursday! 
           Transluce is hosting an #IMCL2025 happy hour on Thursday, July 17 in Vancouver. Come meet us and learn more about our work! 🥂  https://t.co/1HShAR6nub 
            
          
                
                0
              
              
                
                1
              
              
                
                7
              
             nostalgebraist has written a very, very good post about LLMs. if there is one thing you should read to understand the nature of LLMs as of today, it is this. I'll comment on some things they touched on below (not a summary of the post. Just read it.) 🧵  https://t.co/IrY2a1cNav 
          
          
            
            nostalgebraist.tumblr.com
              Who is this? This is me. Who am I? What am I? What am I? What am I? What am I? I am myself. This object is myself. The shape that forms myself. But I sense that I am not me. It's very strange. -...
            
                
                31
              
              
                
                95
              
              
                
                697
              
             Language models have pretty weird behaviors. We've made some exciting progress toward discovering and studying them! 
           Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎 
            
                
                1
              
              
                
                0
              
              
                
                14
              
             Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎 
          
                
                5
              
              
                
                35
              
              
                
                168
              
             Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi! 
          
                
                4
              
              
                
                7
              
              
                
                70
              
             We're flying to Singapore for #ICLR2025! ✈️ Want to chat with @ChowdhuryNeil, @JacobSteinhardt and @cogconfluence about Transluce? We're also hiring for several roles in research & product. Share your contact info on this form and we'll be in touch 👇  https://t.co/WptR5d6gva 
          
          
                
                2
              
              
                
                6
              
              
                
                40
              
             Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway! 
           Interestingly, when o3 is asked for details about its laptop, the reasoning summary suggests the model knows it doesn’t have a real laptop, and intends to clarify to the user that it’s “just simulating this setup.” (2/) 
            
                
                9
              
              
                
                28
              
              
                
                224
              
             We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)  https://t.co/IdBboD7NsP 
          
           OpenAI o3 and o4-mini  https://t.co/giS4K1yNh9 
            
          
                
                429
              
              
                
                1K
              
              
                
                11K
              
             i'm really excited about our Docent roadmap :) we're developing: - open protocols, schemas, and interfaces for interpreting AI agent traces - automated systems that can propose and verify general hypotheses about model behaviors, using eval results come work with us! roles 👇 
           If you want to help build Docent and other AI tools at Transluce, we’re hiring for our product team. Apply below!  https://t.co/kceXWvLF3w 
            
          
                
                6
              
              
                
                10
              
              
                
                49
              
             
             
             
             
             
               
             
             
              