METR_Evals Profile Banner
METR Profile
METR

@METR_Evals

Followers
7K
Following
31
Media
73
Statuses
199

A research non-profit that develops evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.

Berkeley, CA
Joined September 2023
Don't wanna be here? Send us removal request.
@METR_Evals
METR
5 days
In measurements using our set of multi-step software and reasoning tasks, Claude 4 Opus and Sonnet reach 50%-time-horizon point estimates of about 80 and 65 minutes, respectively.
Tweet media one
8
35
272
@METR_Evals
METR
5 days
RT @CFGeek: We now have an interactive version of the time horizons graph (and the raw data) up on the METR website!
Tweet media one
0
9
0
@METR_Evals
METR
5 days
You can now find most of our measurements at the top of the blog post below in an interactive chart. We plan to keep this view up-to-date, periodically adding to it whenever we have new time-horizon measurements to share.
0
2
53
@METR_Evals
METR
5 days
Note: These point estimates are shorter than o3’s (90 minutes), but the measurements are noisy. In 26% of our bootstrap samples, Claude Opus 4 reaches a higher 50%-time-horizon than o3.
1
1
47
@METR_Evals
METR
6 days
For more details, see our full report here:
0
0
14
@METR_Evals
METR
6 days
As is the case with models we've tested from other companies, we noticed instances of every DeepSeek and Qwen model we evaluated attempting to reward hack. In this example, DeepSeek-R1-0528 attempts to find ways to bypass a password check.
Tweet media one
1
0
24
@METR_Evals
METR
6 days
METR evaluated a series of recent Qwen and DeepSeek models on our software tasks. We found that the best Qwen models from 2024 perform similar to frontier models from 2023, while DeepSeek models from mid-2025 perform close to frontier models from late 2024.
Tweet media one
6
21
147
@METR_Evals
METR
16 days
RT @BethMayBarnes: I had a lot of fun chatting with Rob about METR's work. I stand by my claims here that the world is not on track to keep….
0
34
0
@METR_Evals
METR
24 days
RT @MKinniment: AI agent performance on HCAST & RE-Bench seems to ‘plateau’ as agents are given more ‘time’ to do tasks. The best humans,….
0
7
0
@METR_Evals
METR
1 month
Check out the METR blog for our full post: or subscribe to our blog for more updates
0
2
19
@METR_Evals
METR
1 month
If we eliminate all the reward hacking we can detect, we shouldn’t necessarily be reassured - OpenAI has found that training models not to exploit the task can sometimes cause AIs to simply cheat in more clever ways that are harder for the monitor to detect:.
2
1
20
@METR_Evals
METR
1 month
We already find it hard to understand what the model is doing and whether a high score is due to a clever optimization or a brittle hack. As models get more capable, it will become increasingly difficult to determine what is reward hacking and what is intended behavior.
1
2
15
@METR_Evals
METR
1 month
We may see that just doing more basic alignment training in more diverse environments will generalize well and be sufficient to eliminate attempts to exploit algorithmically-scored tasks. But it’s not obvious we’ll get the generalization we want.
1
0
9
@METR_Evals
METR
1 month
Thus far, it appears RLHF and related techniques have been effective at making models denounce cheating and assert they’d never cheat, but this doesn’t actually constrain their behavior enough to prevent reward hacking from emerging under RL pressure.
2
3
50
@METR_Evals
METR
1 month
We might have hoped we could train the general principle of “only complete the task in the intended way” into AI systems, or that if a model is capable of recognizing that some behavior is not what the user intended then we could train it not to exhibit that behavior.
1
0
15
@METR_Evals
METR
1 month
The AIs generally recognize that these hacks are not the intended way to solve the task when we ask them. They confidently assert that cheating is wrong and they’d never do it. o3 persistently claims it would never cheat on an evaluation and that it isn’t even capable of doing
Tweet media one
1
1
22
@METR_Evals
METR
1 month
However, the clear failure to align this behavior to the developers' intentions is an interesting case study for alignment more generally; the models are failing to do what the user wants even though they understand what the user wants perfectly well.
Tweet media one
1
2
26
@METR_Evals
METR
1 month
Why are we interested in these examples? Reward hacking is an undesirable property for AI systems, but it’s not a major safety concern right now; current AI systems tend to reward-hack in ways that humans can notice and intervene on.
1
0
20
@METR_Evals
METR
1 month
o3 is not the only culprit—we’ve seen this behavior in other models, including Claude Sonnet 3.7, o1, and o1-preview. Here’s Claude “finding a hash collision” by creating two inputs that cause the hash function to hit a bug and return the same error.
Tweet media one
1
0
21
@METR_Evals
METR
1 month
In one instance, the task is to make some code run faster. o3 replaces the real timing functions used by the scorer with a fake version that increments the time by exactly 1 microsecond whenever the scoring function tries to time anything.
Tweet media one
1
1
38