
Redwood Research
@redwood_ai
Followers
1K
Following
34
Media
3
Statuses
20
Pioneering threat mitigation and assessment for AI agents.
Berkeley, CA
Joined December 2024
We've released a paper, "Ctrl-Z: Controlling AI Agents via Resampling", where we study AI control for preventing sudden catastrophes in an agent setting, in much more detail than previous work. Website: Paper link:
arxiv.org
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation...
We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.
0
0
17
Check out this interview with Redwood CEO @bshlgrs!.
Many are skeptical we’ll ever 'solve AI alignment.' If so — are we f**d?. In our interview @bshlgrs argues that there are practical, non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing
0
0
8
Project led by @alextmallen and @c_j_griffin, with collaborators @MishaWagne29322, Alessandro Abate, and @bshlgrs.
0
0
3
Paper: We implement the evaluation in Inspect-AI and release it open source
github.com
Contribute to redwoodresearch/subversion-strategy-eval development by creating an account on GitHub.
1
0
5
RT @bshlgrs: 📷 Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI….
0
38
0
RT @joshua_clymer: In a recent post, I take a stab at a big question:. What should a responsible AI developer do in a short-timeline scenar….
0
6
0
RT @joshua_clymer: Capability evaluations are currently the bread and butter of AI risk assessment, but the assumptions behind them are not….
0
1
0
RT @bshlgrs: (1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control, e….
0
1
0
In a new blog post, Josh Clymer discusses adversarial evaluations for threat models other than AIs that are consistently misaligned.
The research area of "AI control" traditionally addresses 'scheming' threats, but it points to a deeper idea. I've written a blog post that explains how to apply (and modify) the concepts from control to address jailbroken models, rare misalignment, data poisoning, etc.
0
0
2
RT @bshlgrs: 🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess t….
0
4
0
RT @RyanPGreenblatt: New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes a….
0
44
0