 
            
              Harry Coppock
            
            @HarryCoppock
Followers
                197
              Following
                1K
              Media
                18
              Statuses
                195
              No. 10 Downing Street Innovation Fellow | Research Scientist at AISI | Visiting Lecturer at Imperial College London Working on AI Evaluation and AI for Medicine
              
              London 
            
            
              
              Joined March 2021
            
            
           Very excited that this systematic analysis is out! We found a bunch of failure modes, as well as interesting and surprising behaviours. Theres a lot more insight we can get from looking carefully at how models are solving evaluation tasks! 
           Measuring how often an AI agent succeeds at a task can help us assess its capabilities – but it doesn’t tell the whole story. We’ve been experimenting with transcript analysis to better understand not just how often agents succeed, but why they fail 🧵 
          
                
                1
              
              
                
                2
              
              
                
                3
              
             New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed. 
          
                
                88
              
              
                
                245
              
              
                
                2K
              
             We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵 
          
                
                3
              
              
                
                12
              
              
                
                49
              
             Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6 
          
                
                8
              
              
                
                62
              
              
                
                300
              
             🔎 People are increasingly using chatbots to seek out new information, raising concerns about how they could misinform voters or distort public opinion. But how is AI actually influencing real-world political beliefs? Our new study explores this question 👇 
          
                
                2
              
              
                
                6
              
              
                
                21
              
             Since I started working on safeguards, we've seen substantial progress in defending certain hosted models, but less progress in measuring & managing misuse risks from open weight models. Three directions I want explored more, drawn from our @AISecurityInst post today 🧵 
          
                
                1
              
              
                
                7
              
              
                
                36
              
             This is great news for the UK. Having worked with Jade over the past 2 years, setting up @AISecurityInst, I am confident that there are very few, if any, who are better placed to take on this role. 
           Absolutely delighted about this - major upgrade on the last AI adviser! Jade brings a tonne of experience in frontier labs, VC and government and will do an amazing job of ensuring the UK is an AI winner. Excellent news. 
            
                
                0
              
              
                
                0
              
              
                
                4
              
             How can open-weight Large Language Models be safeguarded against malicious uses? In our new paper with @AiEleuther, we find that removing harmful data before training can be over 10x more effective at resisting adversarial fine-tuning than defences added after training 🧵 
          
                
                5
              
              
                
                43
              
              
                
                219
              
             We at @AISecurityInst worked with @OpenAI to test GPT-5's safeguards. We identified multiple jailbreaks, including a universal jailbreak that evades all layers of mitigations and is being patched. Excited to continue partnering with OpenAI to test & strengthen safeguards. 
          
                
                17
              
              
                
                24
              
              
                
                128
              
             We deployed 44 AI agents and offered the internet $170K to attack them. 1.8M attempts, 62K breaches, including data leakage and financial loss. 🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵 
          
                
                70
              
              
                
                393
              
              
                
                2K
              
             📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️ 
          
                
                7
              
              
                
                64
              
              
                
                191
              
             We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4 
          
                
                3
              
              
                
                29
              
              
                
                152
              
             🧵 AI Systems are developing advanced cyber capabilities. This means they’re helping strengthen defences - but can also be used as threats. To keep on top of these risks, we need more rigorous evaluations of agentic AI, which is why we’re releasing Inspect Cyber 🔍 
          
                
                1
              
              
                
                13
              
              
                
                58
              
             My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4 
          
                
                4
              
              
                
                37
              
              
                
                172
              
             Brace Yourself: Our Biggest AI Jailbreaking Arena Yet We’re launching a next-level Agent Red-Teaming Challenge—not just chatbots anymore. Think direct & indirect attacks on anonymous frontier models. $100K+ in prizes and raffle giveaways supported by UK @AISecurityInst
          
          
                
                3
              
              
                
                13
              
              
                
                48
              
             Jailbreaking evals ~always focus on simple chatbots—excited to announce AgentHarm, a dataset for measuring harmfulness of LLM 𝑎𝑔𝑒𝑛𝑡𝑠 developed at @AISafetyInst in collaboration with @GraySwanAI! 🧵 1/N 
          
                
                5
              
              
                
                40
              
              
                
                190
              
             AISI is co-hosting DEF CON's generative red teaming challenge this year! Huge thanks to @comathematician @aivillage_dc @defcon for making this happen. (1/6) 
          
                
                1
              
              
                
                6
              
              
                
                29
              
            
            @AISafetyInst will be at @defcon! If you'd like to chat abt attacking, defending, & evaling frontier models, DM me or fill out our form (in 🧵)
          
          
                
                1
              
              
                
                5
              
              
                
                18
              
             We're at #icml2024. If you want to chat about our work or roles, message @herbiebradley (predictive evals) @tomekkorbak (safety cases) @jelennal_ (agents) @CUdudec (testing) @HarryCoppock (cyber evals + AI for med) @oliviagjimenez (recruiting) 
          
                
                1
              
              
                
                4
              
              
                
                11
              
             Is your AI-enabled diagnostic tool accurate, or does your dataset have confounding bias? Our Turing-RSS Health Data Lab paper, published today in Nature Machine Intelligence, investigates audio-based AI classifiers for COVID-19 screening.  https://t.co/ZgexfmyKRX 
          
          
                
                1
              
              
                
                4
              
              
                
                5
              
             
             
               
             
             
             
               
             
            