 
            
              Clément Dumas
            
            @Butanium_
Followers
                531
              Following
                13K
              Media
                82
              Statuses
                743
              MATS 7/7.1 Scholar w/ Neel Nanda MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL AI safety research / improv theater
              
              London
            
            
              
              Joined December 2018
            
            
           New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵 
          
                
                5
              
              
                
                31
              
              
                
                194
              
             Even after staring at these results for months, I still find them quite surprising! I'm excited about the field pushing towards a deeper understanding of introspection in language models. 
           New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude. 
            
                
                20
              
              
                
                24
              
              
                
                604
              
             Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not 
          
                
                5
              
              
                
                37
              
              
                
                179
              
             🚨New paper alert! 🚨 Tandem Training for Language Models  https://t.co/Emzcgf1KHx  Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵 
          
                
                4
              
              
                
                22
              
              
                
                68
              
             A very important paper led by Julian! Tldr: we show that "your narrow finetuning is showing": it might not be a realistic setup to study! 
           New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵 
            
                
                0
              
              
                
                0
              
              
                
                11
              
             You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵 
          
                
                1
              
              
                
                15
              
              
                
                65
              
             FWIW I think small chat models might just learn answer formatting and some reflex like "oh the user forgot some documents I should tell them" see  https://t.co/5k1d2cLkQp  and our work on diffing base and chat models:  https://t.co/iUiwUZ0yU7 
          
           Our recent paper shows: 1. Crrent LLM safety alignment is only a few tokens deep. 2. Deepening the safety alignment can make it more robust against multiple jailbreak attacks. 3. Protecting initial token positions can make the alignment more robust against fine-tuning attacks. 
            
                
                0
              
              
                
                0
              
              
                
                1
              
             I'm wondering if the hybrid model they create would be pareto optimal on the capabilities / reward hacking frontier 👀 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Your thinking model might just learn which reasoning skill to apply and when! Very cool work led by @cvenhoff00 and @IvanArcus ! 
           🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵 
            
                
                1
              
              
                
                0
              
              
                
                5
              
             New post: I trained models on ~realistic reward hacking data. These models don't show emergent misalignment on the EM evals, but they alignment fake, are more competently misaligned, and highly evaluation-aware. These effects persist when mixing benign data into training. 
          
                
                2
              
              
                
                6
              
              
                
                44
              
             "AI Transparency Days – Mechanistic Interpretability Track 🧠✨ Hackathon-style research on LLM internals: join or form teams, explore mech interp, share results. 📍 Nuremberg/Fürth | Oct 17–19 🔗  https://t.co/OjH9czJSSk 
            #AITDays25 #MechanisticInterpretability #AITransparency
          
          
                
                0
              
              
                
                2
              
              
                
                1
              
            
            @sleepinyourhat If GPT-5 is considered best aligned by this metric, I am highly skeptical that the metric is measuring any general sense of alignment. I get why GPT-5 barely does anything except follow instructions and has low/inhibited situational awareness, so I guess you won't see many
          
          
                
                1
              
              
                
                1
              
              
                
                19
              
             SAEs applied to frontier models to ensure the decrease of misalignment wasn't due to an increase of eval awareness!!! 
           Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15) 
            
                
                0
              
              
                
                0
              
              
                
                12
              
             [Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be... 
          
                
                8
              
              
                
                10
              
              
                
                136
              
             Rumor says every paper Julian touches turns into gold 
           My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius!  https://t.co/CLwavKQDX5  Thesis: 
          
                
                1
              
              
                
                0
              
              
                
                20
              
             Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7! 
           [📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings. 
            
                
                3
              
              
                
                37
              
              
                
                191
              
             Deep learning is coming for robotics. It's plausible to me that AI will exceed human performance at strategic cognitive tasks around the same time robotics will exceed human-body performance at most tasks And if not, superhuman AI will quickly allow robots to leapfrog humans 
           We built a robot brain that nothing can stop. Shattered limbs? Jammed motors? If the bot can move, the Brain will move it— even if it’s an entirely new robot body. Meet the omni-bodied Skild Brain: 
            
                
                4
              
              
                
                11
              
              
                
                120
              
            
             https://t.co/74y5BTAE7t  Fascinating post by a Cyborgism regular: LLMs whose main personas are more attuned to embodiment & subjectivity are *less* likely to pretend to be human! Relatedly, I've noticed an inverse correlation between embodiment/emotionality and reward hacking.
          
          
            
            lesswrong.com
              Generated by Google Gemini (nano-banana) Whether AI or human, lend me your ears. …
            
                
                6
              
              
                
                19
              
              
                
                157
              
             
             
               
             
             
               
             
               
               
             
               
               
             
             
             
               
            