 
            
              Micah Carroll
            
            @MicahCarroll
Followers
                1K
              Following
                4K
              Media
                45
              Statuses
                689
              AI PhD student @berkeley_ai /w @ancadianadragan & Stuart Russell. Working on AI safety ⊃ preference changes/AI manipulation.
              
              🇮🇹🇬🇧 → Berkeley, CA
            
            
              
              Joined August 2011
            
            
           🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇 
          
                
                6
              
              
                
                75
              
              
                
                270
              
             Future AIs might secretly pursue unintended goals — “scheme”. In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. We see major improvements, but they may be partially explained by AIs knowing when they are evaluated. 
          
                
                6
              
              
                
                34
              
              
                
                137
              
             New paper: Can LLMs do multi-step reasoning without chain-of-thought? Models can answer questions like "Who is the spouse of the singer of Imagine?". But is this true internal reasoning (Imagine->John Lennon->Yoko) or memorization/pattern matching? We now have a better answer! 
          
                
                9
              
              
                
                65
              
              
                
                429
              
             How can open-weight Large Language Models be safeguarded against malicious uses? In our new paper with @AiEleuther, we find that removing harmful data before training can be over 10x more effective at resisting adversarial fine-tuning than defences added after training 🧵 
          
                
                5
              
              
                
                42
              
              
                
                219
              
             gpt-5 is above trend if you or someone you know has updated to “agi over bro” after its release, i have no idea what model of the future you were working with extrapolate this and we have models doing month-long projects in 2027 
          
                
                105
              
              
                
                70
              
              
                
                1K
              
             The good news: due to increased access (plus improved evals science) we were able to do a more meaningful evaluation than with past models, and we think we have substantial evidence that this model does not pose a catastrophic risk via autonomy / loss of control threat models. 
           In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness. 
            
                
                15
              
              
                
                39
              
              
                
                526
              
             Today (w/ @UniofOxford @Stanford @MIT @LSEnews) we’re sharing the results of the largest AI persuasion experiments to date: 76k participants, 19  LLMs, 707 political issues. We examine “levers” of AI persuasion: model scale, post-training, prompting, personalization, & more 🧵 
          
                
                14
              
              
                
                129
              
              
                
                437
              
             We’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biology & chemistry under our Preparedness Framework. Here’s why that matters–and what we’re doing to keep it safe. 🧵 
           We’ve decided to treat this launch as High Capability in the Biological and Chemical domain under our Preparedness Framework, and activated the associated safeguards. This is a precautionary approach, and we detail our safeguards in the system card. We outlined our approach on 
          
                
                87
              
              
                
                132
              
              
                
                1K
              
             Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵 
          
                
                12
              
              
                
                69
              
              
                
                330
              
             A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 
          
                
                38
              
              
                
                108
              
              
                
                449
              
             User simulators bridge RL with real-world interaction //  https://t.co/bsrYxVHuVo  How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve 
          
                
                10
              
              
                
                46
              
              
                
                340
              
             *New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies! 
          
                
                9
              
              
                
                30
              
              
                
                145
              
             We already find it hard to understand what the model is doing and whether a high score is due to a clever optimization or a brittle hack. As models get more capable, it will become increasingly difficult to determine what is reward hacking and what is intended behavior. 
          
                
                1
              
              
                
                2
              
              
                
                15
              
             1. How can we remain healthy and free while engaging in extended personal interaction with AI agents that shape our behaviour and preferences? One answer is "socioaffective alignment" as discussed in our new paper @Nature Humanities & Social Sciences!  https://t.co/gQO5bXEA3a 
          
          
                
                7
              
              
                
                21
              
              
                
                66
              
             I've been really feeling how much the general public is concerned about AI risk... In a *weird* amount of recent interactions with normal people (eg my hairdresser) when I say I do AI research (*not* safety), they ask if AI will take over Alas, I have no reassurances to offer 
          
                
                43
              
              
                
                16
              
              
                
                480
              
             A great @washingtonpost story to be quoted in. I spoke to @nitashatiku re our work on human-AI relationships as well as early results from our @UniofOxford survey of 2k UK citizens showing ~30% have sought AI companionship, emotional support or social interaction in the past year 
           Chatbots tuned to win people over can end up saying dangerous things to vulnerable users, a new study found. 
          
                
                2
              
              
                
                12
              
              
                
                69
              
             AI is speedrunning the social media era by optimizing chatbots for engagement, user feedback, + time spent. Evidence is mounting that this poses unintended risks, including chats from peer-reviewed research, OpenAI's "sycophancy" debacle, & Character ai lawsuits 
          
                
                2
              
              
                
                10
              
              
                
                39
              
             LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by @nitashatiku & @washingtonpost! 
           AI is speedrunning the social media era by optimizing chatbots for engagement, user feedback, + time spent. Evidence is mounting that this poses unintended risks, including chats from peer-reviewed research, OpenAI's "sycophancy" debacle, & Character ai lawsuits 
          
                
                1
              
              
                
                18
              
              
                
                67
              
             What to do about gradual disempowerment? We laid out a research agenda with all the concrete and feasible research projects we can think of: 🧵 with @raymondadouglas @jankulveit @DavidSKrueger
          
          
                
                5
              
              
                
                35
              
              
                
                192
              
             New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵 
          
                
                4
              
              
                
                11
              
              
                
                47
              
             Why do human–AI relationships need socioaffective alignment? As AI evolves from tools to companions, we must seek systems that enhance rather than exploit our nature as social & emotional beings. Published today in @Nature Humanities & Social Sciences!  https://t.co/y92riRuvDF 
          
          
                
                8
              
              
                
                54
              
              
                
                281
              
             
             
             
             
             
             
               
             
             
               
             
             
             
             
             
             
               
             
            