Rohin Shah @rohinmshah X Profile

Rohin Shah

@rohinmshah

Followers

9K

Following

327

Media

24

Statuses

346

AGI Safety & Alignment @ Google DeepMind

https://t.co/bFwHovcNUC

London, UK

Joined October 2017

Don't wanna be here? Send us removal request.

Rohin Shah

@rohinmshah

2 months

Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. https://t.co/Q6B0XXqmPq

Mikita Balesni 🇺🇦

@balesni

2 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

4

9

85

Rohin Shah

@rohinmshah

2 months

Of course, chain-of-thought monitoring won't scale arbitrarily far, so we're continuing to research additional alignment and control techniques that can reduce risks further. Read more in our blog post:

deepmindsafetyresearch.medium.com

By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

0

1

19

Rohin Shah

@rohinmshah

2 months

Eventually models will become capable of stealth and situational awareness. At that point we expect to rely on chain-of-thought monitoring. Our second paper suggests that it will be difficult for models to evade such monitoring to cause severe harm. https://t.co/0Hgjc8xDPJ

Scott Emmons

@emmons_scott

2 months

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models

7

15

109

Rohin Shah

@rohinmshah

2 months

Two new papers that elaborate on our approach to deceptive alignment! First paper: we evaluate the model's *stealth* and *situational awareness* -- if they don't have these capabilities, they likely can't cause severe harm. https://t.co/kywkwyH0Ld

Victoria Krakovna

@vkrakovna

2 months

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. https://t.co/PrfZcIVuEw

2

14

107

Anca Dragan

@ancadianadragan

4 months

Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.

2

13

79

Tom Everitt

@tom4everitt

5 months

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵

22

45

236

Rohin Shah

@rohinmshah

5 months

We’re excited to contribute to the broader conversation on AGI safety and hope to collaborate with others on these topics. Check out the blog post: https://t.co/7Z0ZrP5lMb Paper (with a 10 page summary at the start):

arxiv.org

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to...

1

2

21

Rohin Shah

@rohinmshah

5 months

Misalignment measures are much more multifaceted – our many pages are too big to fit in this margin. The approach aims to have two lines of defense – (1) train the model not to pursue misaligned goals and (2) mitigate harm even if the model pursues misaligned goals.

1

16

Rohin Shah

@rohinmshah

5 months

For misuse, the key idea is to block access of those capabilities for bad actors, via safety training, monitoring, and model weight security. The Safety and Security teams have worked especially closely on misuse, and we put it in practice with our Frontier Safety Framework.

1

0

10

Rohin Shah

@rohinmshah

5 months

There are many kinds of risks with many kinds of mitigations. To guide our thinking, we’ve grouped together different risks based on similarity of mitigations, leading to four areas: misuse, misalignment, mistakes and structural risks.

3

0

20

Rohin Shah

@rohinmshah

5 months

Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.) AGI will be transformative. It enables massive benefits, but could also pose risks. Responsible development means proactively preparing for severe harms before they arise.

Google DeepMind

@GoogleDeepMind

5 months

AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly. Today, we’re sharing how we’re thinking about safety and security on the path to AGI. → https://t.co/tS21BCo8Er

14

71

361

Rohin Shah

@rohinmshah

7 months

More details on one of the roles we're hiring for on the GDM safety team https://t.co/EbdGq3uG4R

Arthur Conmy

@ArthurConmy

7 months

We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a

1

10

Rohin Shah

@rohinmshah

7 months

We're looking for strong ML researchers and software engineers. You *don't* need to be an expert on AGI safety; we're happy to train you. Learn more: https://t.co/osOxFWneFs Research Engineer role: https://t.co/MWdEZkczCL Research Scientist role:

12

204

Rohin Shah

@rohinmshah

7 months

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

11

37

299

Rohin Shah

@rohinmshah

7 months

New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. https://t.co/GZB0OVB6Fd

Victoria Krakovna

@vkrakovna

7 months

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total).

1

5

30

Rohin Shah

@rohinmshah

7 months

Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! https://t.co/LWEll4kcsS

Max Nadeau

@MaxNadeau_

7 months

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

2

4

44

Rohin Shah

@rohinmshah

7 months

Now with an approach to deceptive alignment -- first such policy to do so! https://t.co/lE6MpPYnU6

Google DeepMind

@GoogleDeepMind

7 months

As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → https://t.co/YwtVDqQWW9

2

12

113

Rohin Shah

@rohinmshah

8 months

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. https://t.co/xCj4vo3Qn5

David Lindner

@davlindner

8 months

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

0

13

89

Rohin Shah

@rohinmshah

9 months

Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! https://t.co/FpvKoHNMTH

Rishub Jain

@shubadubadub

9 months

How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it. 🧵(1/10)

0

2

11

Allan Dafoe

@AllanDafoe

11 months

Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵

Allan Dafoe

@AllanDafoe

1 year

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/

2

20

144