rohinmshah Profile Banner
Rohin Shah Profile
Rohin Shah

@rohinmshah

Followers
9K
Following
327
Media
24
Statuses
346

AGI Safety & Alignment @ Google DeepMind

London, UK
Joined October 2017
Don't wanna be here? Send us removal request.
@rohinmshah
Rohin Shah
2 months
Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. https://t.co/Q6B0XXqmPq
@balesni
Mikita Balesni 🇺🇦
2 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:
Tweet media one
4
9
85
@rohinmshah
Rohin Shah
2 months
Of course, chain-of-thought monitoring won't scale arbitrarily far, so we're continuing to research additional alignment and control techniques that can reduce risks further. Read more in our blog post:
Tweet card summary image
deepmindsafetyresearch.medium.com
By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah
0
1
19
@rohinmshah
Rohin Shah
2 months
Eventually models will become capable of stealth and situational awareness. At that point we expect to rely on chain-of-thought monitoring. Our second paper suggests that it will be difficult for models to evade such monitoring to cause severe harm. https://t.co/0Hgjc8xDPJ
@emmons_scott
Scott Emmons
2 months
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
Tweet media one
7
15
109
@rohinmshah
Rohin Shah
2 months
Two new papers that elaborate on our approach to deceptive alignment! First paper: we evaluate the model's *stealth* and *situational awareness* -- if they don't have these capabilities, they likely can't cause severe harm. https://t.co/kywkwyH0Ld
@vkrakovna
Victoria Krakovna
2 months
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. https://t.co/PrfZcIVuEw
Tweet media one
2
14
107
@ancadianadragan
Anca Dragan
4 months
Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.
2
13
79
@tom4everitt
Tom Everitt
5 months
What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵
Tweet media one
22
45
236
@rohinmshah
Rohin Shah
5 months
We’re excited to contribute to the broader conversation on AGI safety and hope to collaborate with others on these topics. Check out the blog post: https://t.co/7Z0ZrP5lMb Paper (with a 10 page summary at the start):
Tweet card summary image
arxiv.org
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to...
1
2
21
@rohinmshah
Rohin Shah
5 months
Misalignment measures are much more multifaceted – our many pages are too big to fit in this margin. The approach aims to have two lines of defense – (1) train the model not to pursue misaligned goals and (2) mitigate harm even if the model pursues misaligned goals.
Tweet media one
1
1
16
@rohinmshah
Rohin Shah
5 months
For misuse, the key idea is to block access of those capabilities for bad actors, via safety training, monitoring, and model weight security. The Safety and Security teams have worked especially closely on misuse, and we put it in practice with our Frontier Safety Framework.
Tweet media one
1
0
10
@rohinmshah
Rohin Shah
5 months
There are many kinds of risks with many kinds of mitigations. To guide our thinking, we’ve grouped together different risks based on similarity of mitigations, leading to four areas: misuse, misalignment, mistakes and structural risks.
Tweet media one
3
0
20
@rohinmshah
Rohin Shah
5 months
Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.) AGI will be transformative. It enables massive benefits, but could also pose risks. Responsible development means proactively preparing for severe harms before they arise.
Tweet media one
@GoogleDeepMind
Google DeepMind
5 months
AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly. Today, we’re sharing how we’re thinking about safety and security on the path to AGI. → https://t.co/tS21BCo8Er
14
71
361
@rohinmshah
Rohin Shah
7 months
More details on one of the roles we're hiring for on the GDM safety team https://t.co/EbdGq3uG4R
@ArthurConmy
Arthur Conmy
7 months
We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a
1
1
10
@rohinmshah
Rohin Shah
7 months
We're looking for strong ML researchers and software engineers. You *don't* need to be an expert on AGI safety; we're happy to train you. Learn more: https://t.co/osOxFWneFs Research Engineer role: https://t.co/MWdEZkczCL Research Scientist role:
12
12
204
@rohinmshah
Rohin Shah
7 months
We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.
Tweet media one
11
37
299
@rohinmshah
Rohin Shah
7 months
New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. https://t.co/GZB0OVB6Fd
@vkrakovna
Victoria Krakovna
7 months
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total).
1
5
30
@rohinmshah
Rohin Shah
7 months
Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! https://t.co/LWEll4kcsS
@MaxNadeau_
Max Nadeau
7 months
🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
Tweet media one
2
4
44
@rohinmshah
Rohin Shah
7 months
Now with an approach to deceptive alignment -- first such policy to do so! https://t.co/lE6MpPYnU6
@GoogleDeepMind
Google DeepMind
7 months
As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → https://t.co/YwtVDqQWW9
2
12
113
@rohinmshah
Rohin Shah
8 months
New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. https://t.co/xCj4vo3Qn5
@davlindner
David Lindner
8 months
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
Tweet media one
0
13
89
@rohinmshah
Rohin Shah
9 months
Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! https://t.co/FpvKoHNMTH
@shubadubadub
Rishub Jain
9 months
How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it. 🧵(1/10)
Tweet media one
0
2
11
@AllanDafoe
Allan Dafoe
11 months
Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵
@AllanDafoe
Allan Dafoe
1 year
We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/
2
20
144