
Rohin Shah
@rohinmshah
Followers
9K
Following
327
Media
24
Statuses
346
AGI Safety & Alignment @ Google DeepMind
London, UK
Joined October 2017
Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. https://t.co/Q6B0XXqmPq
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:
4
9
85
Of course, chain-of-thought monitoring won't scale arbitrarily far, so we're continuing to research additional alignment and control techniques that can reduce risks further. Read more in our blog post:
deepmindsafetyresearch.medium.com
By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah
0
1
19
Eventually models will become capable of stealth and situational awareness. At that point we expect to rely on chain-of-thought monitoring. Our second paper suggests that it will be difficult for models to evade such monitoring to cause severe harm. https://t.co/0Hgjc8xDPJ
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
7
15
109
Two new papers that elaborate on our approach to deceptive alignment! First paper: we evaluate the model's *stealth* and *situational awareness* -- if they don't have these capabilities, they likely can't cause severe harm. https://t.co/kywkwyH0Ld
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. https://t.co/PrfZcIVuEw
2
14
107
Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.
2
13
79
What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵
22
45
236
We’re excited to contribute to the broader conversation on AGI safety and hope to collaborate with others on these topics. Check out the blog post: https://t.co/7Z0ZrP5lMb Paper (with a 10 page summary at the start):
arxiv.org
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to...
1
2
21
Misalignment measures are much more multifaceted – our many pages are too big to fit in this margin. The approach aims to have two lines of defense – (1) train the model not to pursue misaligned goals and (2) mitigate harm even if the model pursues misaligned goals.
1
1
16
For misuse, the key idea is to block access of those capabilities for bad actors, via safety training, monitoring, and model weight security. The Safety and Security teams have worked especially closely on misuse, and we put it in practice with our Frontier Safety Framework.
1
0
10
There are many kinds of risks with many kinds of mitigations. To guide our thinking, we’ve grouped together different risks based on similarity of mitigations, leading to four areas: misuse, misalignment, mistakes and structural risks.
3
0
20
Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.) AGI will be transformative. It enables massive benefits, but could also pose risks. Responsible development means proactively preparing for severe harms before they arise.
AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly. Today, we’re sharing how we’re thinking about safety and security on the path to AGI. → https://t.co/tS21BCo8Er
14
71
361
More details on one of the roles we're hiring for on the GDM safety team https://t.co/EbdGq3uG4R
We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a
1
1
10
We're looking for strong ML researchers and software engineers. You *don't* need to be an expert on AGI safety; we're happy to train you. Learn more: https://t.co/osOxFWneFs Research Engineer role: https://t.co/MWdEZkczCL Research Scientist role:
12
12
204
We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.
11
37
299
New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. https://t.co/GZB0OVB6Fd
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total).
1
5
30
Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! https://t.co/LWEll4kcsS
🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
2
4
44
Now with an approach to deceptive alignment -- first such policy to do so! https://t.co/lE6MpPYnU6
As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → https://t.co/YwtVDqQWW9
2
12
113
New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. https://t.co/xCj4vo3Qn5
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
0
13
89
Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! https://t.co/FpvKoHNMTH
How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it. 🧵(1/10)
0
2
11
Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵
We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/
2
20
144