Redwood Research @redwood_ai X Profile

Redwood Research

@redwood_ai

Followers

1K

Following

34

Media

3

Statuses

20

Pioneering threat mitigation and assessment for AI agents.

Berkeley, CA

Joined December 2024

Don't wanna be here? Send us removal request.

Redwood Research

@redwood_ai

4 months

We've released a paper, "Ctrl-Z: Controlling AI Agents via Resampling", where we study AI control for preventing sudden catastrophes in an agent setting, in much more detail than previous work. Website: Paper link:

arxiv.org

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation...

Buck Shlegeris

@bshlgrs

4 months

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

0

17

Redwood Research

@redwood_ai

4 months

AI-powered audio narrations of the Redwood Research blog are now available! Subscribe by searching “Redwood Research” on your podcast app.

1

38

Grok

@grok

2 days

What do you want to know?.

125

31

312

Redwood Research

@redwood_ai

5 months

Check out this interview with Redwood CEO @bshlgrs!.

Rob Wiblin

@robertwiblin

5 months

Many are skeptical we’ll ever 'solve AI alignment.' If so — are we f**d?. In our interview @bshlgrs argues that there are practical, non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing

0

8

Redwood Research

@redwood_ai

5 months

Project led by @alextmallen and @c_j_griffin, with collaborators @MishaWagne29322, Alessandro Abate, and @bshlgrs.

0

3

Redwood Research

@redwood_ai

5 months

Paper: We implement the evaluation in Inspect-AI and release it open source

github.com

Contribute to redwoodresearch/subversion-strategy-eval development by creating an account on GitHub.

1

0

5

Redwood Research

@redwood_ai

5 months

Evaluating a model's ability to reason about subverting a protocol in-context may allow less conservative control evaluations. E.g. If strategic evaluations show models struggle with consistent output collusion, protocols like untrusted monitoring may be deemed safer.

1

0

1

Redwood Research

@redwood_ai

5 months

Shared planning improves performance in most environments. This may be due to:.(1) allowing the instances to communicate causally in environments that require coordination.(2) reducing the need for *reliability in planning* by correlating the model’s errors.

1

0

2

Redwood Research

@redwood_ai

5 months

All models we tested generally perform substantially worse than optimal, except in upfront auditing environments. In the paper we show that high performance in some of these cases is likely incidental.

1

0

2

Redwood Research

@redwood_ai

5 months

In “untrusted monitoring” the AI is monitoring itself, and has to make sure it reports honeypots but not its own attacks.

1

0

2

Redwood Research

@redwood_ai

5 months

For example, in the “upfront auditing” environment, the AI is given 100 sequential problems and has to eventually attack at least 1. The first 10 are audited, but the AI doesn’t ever know which problem it’s in.

1

0

1

Redwood Research

@redwood_ai

5 months

Building on our prior work, we evaluate models, including helpful-only Claude 3.5 sonnet and o3-mini, in 8 toy simulated deployments for their ability to strategically choose high-level actions to subvert control protocols *statelessly* (without shared memory between contexts).

1

0

2

Redwood Research

@redwood_ai

5 months

🧵How plausible is it that a misaligned LLM could strategize to subvert the safety measures of a deployment while operating without persistent memory?. In our most recent paper, we release Subversion Strategy Eval to measure this capability against control protocols.

3

6

38

Redwood Research

@redwood_ai

6 months

RT @bshlgrs: 📷 Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI….

0

38

0

Redwood Research

@redwood_ai

6 months

We're very excited for the AI control team that UK AISI is putting together.

Jasmine

@j_asminewang

6 months

I’m leading a new team at AISI focused on control empirics. We’re hiring research engineers and research scientists, and you should join us!.

0

4

Redwood Research

@redwood_ai

7 months

RT @joshua_clymer: In a recent post, I take a stab at a big question:. What should a responsible AI developer do in a short-timeline scenar….

0

6

0

Redwood Research

@redwood_ai

7 months

RT @joshua_clymer: Capability evaluations are currently the bread and butter of AI risk assessment, but the assumptions behind them are not….

0

1

0

Redwood Research

@redwood_ai

7 months

RT @bshlgrs: (1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control, e….

0

1

0

Redwood Research

@redwood_ai

7 months

In a new blog post, Josh Clymer discusses adversarial evaluations for threat models other than AIs that are consistently misaligned.

Joshua Clymer

@joshua_clymer

7 months

The research area of "AI control" traditionally addresses 'scheming' threats, but it points to a deeper idea. I've written a blog post that explains how to apply (and modify) the concepts from control to address jailbroken models, rare misalignment, data poisoning, etc.

0

2

Redwood Research

@redwood_ai

8 months

RT @bshlgrs: 🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess t….

0

4

0

Redwood Research

@redwood_ai

8 months

RT @RyanPGreenblatt: New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes a….

0

44

0