redwood_ai Profile Banner
Redwood Research Profile
Redwood Research

@redwood_ai

Followers
1K
Following
34
Media
3
Statuses
20

Pioneering threat mitigation and assessment for AI agents.

Berkeley, CA
Joined December 2024
Don't wanna be here? Send us removal request.
@redwood_ai
Redwood Research
4 months
We've released a paper, "Ctrl-Z: Controlling AI Agents via Resampling", where we study AI control for preventing sudden catastrophes in an agent setting, in much more detail than previous work. Website: Paper link:
Tweet card summary image
arxiv.org
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation...
@bshlgrs
Buck Shlegeris
4 months
We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.
0
0
17
@redwood_ai
Redwood Research
4 months
AI-powered audio narrations of the Redwood Research blog are now available! Subscribe by searching “Redwood Research” on your podcast app.
1
1
38
@grok
Grok
2 days
What do you want to know?.
125
31
312
@redwood_ai
Redwood Research
5 months
Check out this interview with Redwood CEO @bshlgrs!.
@robertwiblin
Rob Wiblin
5 months
Many are skeptical we’ll ever 'solve AI alignment.' If so — are we f**d?. In our interview @bshlgrs argues that there are practical, non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing
0
0
8
@redwood_ai
Redwood Research
5 months
Project led by @alextmallen and @c_j_griffin, with collaborators @MishaWagne29322, Alessandro Abate, and @bshlgrs.
0
0
3
@redwood_ai
Redwood Research
5 months
Paper: We implement the evaluation in Inspect-AI and release it open source
Tweet card summary image
github.com
Contribute to redwoodresearch/subversion-strategy-eval development by creating an account on GitHub.
1
0
5
@redwood_ai
Redwood Research
5 months
Evaluating a model's ability to reason about subverting a protocol in-context may allow less conservative control evaluations. E.g. If strategic evaluations show models struggle with consistent output collusion, protocols like untrusted monitoring may be deemed safer.
1
0
1
@redwood_ai
Redwood Research
5 months
Shared planning improves performance in most environments. This may be due to:.(1) allowing the instances to communicate causally in environments that require coordination.(2) reducing the need for *reliability in planning* by correlating the model’s errors.
Tweet media one
1
0
2
@redwood_ai
Redwood Research
5 months
All models we tested generally perform substantially worse than optimal, except in upfront auditing environments. In the paper we show that high performance in some of these cases is likely incidental.
Tweet media one
1
0
2
@redwood_ai
Redwood Research
5 months
In “untrusted monitoring” the AI is monitoring itself, and has to make sure it reports honeypots but not its own attacks.
1
0
2
@redwood_ai
Redwood Research
5 months
For example, in the “upfront auditing” environment, the AI is given 100 sequential problems and has to eventually attack at least 1. The first 10 are audited, but the AI doesn’t ever know which problem it’s in.
1
0
1
@redwood_ai
Redwood Research
5 months
Building on our prior work, we evaluate models, including helpful-only Claude 3.5 sonnet and o3-mini, in 8 toy simulated deployments for their ability to strategically choose high-level actions to subvert control protocols *statelessly* (without shared memory between contexts).
1
0
2
@redwood_ai
Redwood Research
5 months
🧵How plausible is it that a misaligned LLM could strategize to subvert the safety measures of a deployment while operating without persistent memory?. In our most recent paper, we release Subversion Strategy Eval to measure this capability against control protocols.
Tweet media one
3
6
38
@redwood_ai
Redwood Research
6 months
RT @bshlgrs: 📷 Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI….
0
38
0
@redwood_ai
Redwood Research
6 months
We're very excited for the AI control team that UK AISI is putting together.
@j_asminewang
Jasmine
6 months
I’m leading a new team at AISI focused on control empirics. We’re hiring research engineers and research scientists, and you should join us!.
0
0
4
@redwood_ai
Redwood Research
7 months
RT @joshua_clymer: In a recent post, I take a stab at a big question:. What should a responsible AI developer do in a short-timeline scenar….
0
6
0
@redwood_ai
Redwood Research
7 months
RT @joshua_clymer: Capability evaluations are currently the bread and butter of AI risk assessment, but the assumptions behind them are not….
0
1
0
@redwood_ai
Redwood Research
7 months
RT @bshlgrs: (1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control, e….
0
1
0
@redwood_ai
Redwood Research
7 months
In a new blog post, Josh Clymer discusses adversarial evaluations for threat models other than AIs that are consistently misaligned.
@joshua_clymer
Joshua Clymer
7 months
The research area of "AI control" traditionally addresses 'scheming' threats, but it points to a deeper idea. I've written a blog post that explains how to apply (and modify) the concepts from control to address jailbroken models, rare misalignment, data poisoning, etc.
0
0
2
@redwood_ai
Redwood Research
8 months
RT @bshlgrs: 🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess t….
0
4
0
@redwood_ai
Redwood Research
8 months
RT @RyanPGreenblatt: New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes a….
0
44
0