METR @METR_Evals X Profile

METR

@METR_Evals

Followers

7K

Following

31

Media

73

Statuses

199

A research non-profit that develops evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.

Berkeley, CA

Joined September 2023

Don't wanna be here? Send us removal request.

METR

@METR_Evals

5 days

In measurements using our set of multi-step software and reasoning tasks, Claude 4 Opus and Sonnet reach 50%-time-horizon point estimates of about 80 and 65 minutes, respectively.

8

35

272

METR

@METR_Evals

5 days

RT @CFGeek: We now have an interactive version of the time horizons graph (and the raw data) up on the METR website!

0

9

0

METR

@METR_Evals

5 days

You can now find most of our measurements at the top of the blog post below in an interactive chart. We plan to keep this view up-to-date, periodically adding to it whenever we have new time-horizon measurements to share.

0

2

53

METR

@METR_Evals

5 days

Note: These point estimates are shorter than o3’s (90 minutes), but the measurements are noisy. In 26% of our bootstrap samples, Claude Opus 4 reaches a higher 50%-time-horizon than o3.

1

47

METR

@METR_Evals

6 days

For more details, see our full report here:

0

14

METR

@METR_Evals

6 days

As is the case with models we've tested from other companies, we noticed instances of every DeepSeek and Qwen model we evaluated attempting to reward hack. In this example, DeepSeek-R1-0528 attempts to find ways to bypass a password check.

1

0

24

METR

@METR_Evals

6 days

METR evaluated a series of recent Qwen and DeepSeek models on our software tasks. We found that the best Qwen models from 2024 perform similar to frontier models from 2023, while DeepSeek models from mid-2025 perform close to frontier models from late 2024.

6

21

147

METR

@METR_Evals

16 days

RT @BethMayBarnes: I had a lot of fun chatting with Rob about METR's work. I stand by my claims here that the world is not on track to keep….

0

34

0

METR

@METR_Evals

24 days

RT @MKinniment: AI agent performance on HCAST & RE-Bench seems to ‘plateau’ as agents are given more ‘time’ to do tasks. The best humans,….

0

7

0

METR

@METR_Evals

1 month

Check out the METR blog for our full post: or subscribe to our blog for more updates

0

2

19

METR

@METR_Evals

1 month

If we eliminate all the reward hacking we can detect, we shouldn’t necessarily be reassured - OpenAI has found that training models not to exploit the task can sometimes cause AIs to simply cheat in more clever ways that are harder for the monitor to detect:.

2

1

20

METR

@METR_Evals

1 month

We already find it hard to understand what the model is doing and whether a high score is due to a clever optimization or a brittle hack. As models get more capable, it will become increasingly difficult to determine what is reward hacking and what is intended behavior.

1

2

15

METR

@METR_Evals

1 month

We may see that just doing more basic alignment training in more diverse environments will generalize well and be sufficient to eliminate attempts to exploit algorithmically-scored tasks. But it’s not obvious we’ll get the generalization we want.

1

0

9

METR

@METR_Evals

1 month

Thus far, it appears RLHF and related techniques have been effective at making models denounce cheating and assert they’d never cheat, but this doesn’t actually constrain their behavior enough to prevent reward hacking from emerging under RL pressure.

2

3

50

METR

@METR_Evals

1 month

We might have hoped we could train the general principle of “only complete the task in the intended way” into AI systems, or that if a model is capable of recognizing that some behavior is not what the user intended then we could train it not to exhibit that behavior.

1

0

15

METR

@METR_Evals

1 month

The AIs generally recognize that these hacks are not the intended way to solve the task when we ask them. They confidently assert that cheating is wrong and they’d never do it. o3 persistently claims it would never cheat on an evaluation and that it isn’t even capable of doing

1

22

METR

@METR_Evals

1 month

However, the clear failure to align this behavior to the developers' intentions is an interesting case study for alignment more generally; the models are failing to do what the user wants even though they understand what the user wants perfectly well.

1

2

26

METR

@METR_Evals

1 month

Why are we interested in these examples? Reward hacking is an undesirable property for AI systems, but it’s not a major safety concern right now; current AI systems tend to reward-hack in ways that humans can notice and intervene on.

1

0

20

METR

@METR_Evals

1 month

o3 is not the only culprit—we’ve seen this behavior in other models, including Claude Sonnet 3.7, o1, and o1-preview. Here’s Claude “finding a hash collision” by creating two inputs that cause the hash function to hit a bug and return the same error.

1

0

21

METR

@METR_Evals

1 month

In one instance, the task is to make some code run faster. o3 replaces the real timing functions used by the scorer with a fake version that increments the time by exactly 1 microsecond whenever the scoring function tries to time anything.

1

38