Elizabeth Barnes Profile
Elizabeth Barnes

@BethMayBarnes

Followers
2K
Following
57
Media
10
Statuses
188

Joined July 2014
Don't wanna be here? Send us removal request.
@BethMayBarnes
Elizabeth Barnes
1 month
I had a lot of fun chatting with Rob about METR's work. I stand by my claims here that the world is not on track to keep risk from AI to an acceptable level, and we desperately need more people working on these problems.
@robertwiblin
Rob Wiblin
1 month
AI models currently have a 50% chance of doing something that takes a human expert one hour. This doubles every 7 months. In 2 years? They could automate full workdays. In 4 years? A full month. I discuss the most important graph in AI today with Beth Barnes, the CEO of METR,
11
34
292
@BethMayBarnes
Elizabeth Barnes
4 months
Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread.
@MKinniment
Megan Kinniment
4 months
Happy for this to be released!. It’s the result of a lot of hard work from many of us at METR :) . A big question is whether these results apply to ‘real’ tasks. Here’s some thoughts on that:.
1
4
85
@BethMayBarnes
Elizabeth Barnes
4 months
RT @AndrewYang: Guys, AI is going to eat a shit ton of jobs. I don’t see anyone really talking about this meaningfully in terms of what to….
0
560
0
@BethMayBarnes
Elizabeth Barnes
4 months
RT @MKinniment: Happy for this to be released!. It’s the result of a lot of hard work from many of us at METR :) . A big question is whethe….
0
10
0
@BethMayBarnes
Elizabeth Barnes
4 months
We see some evidence that models have worse performance on tasks that include more realistic "messiness", involve working with larger existing codebases, etc.
0
0
21
@BethMayBarnes
Elizabeth Barnes
4 months
There are also reasons we might be overestimating model capabilities. The most important is probably the way our benchmark tasks differ from the real world: they're selected to be automatically scoreable, and easy to set up in a self-contained task environment.
1
0
21
@BethMayBarnes
Elizabeth Barnes
4 months
Specialized AI agents (e.g. Deep Research) can perform much better than generalist agents at narrow tasks. (This might be particularly important if AI labor can be used to automate the process of producing specialized models and scaffolds.).
1
0
13
@BethMayBarnes
Elizabeth Barnes
4 months
Also: A single model has to complete all the tasks, whereas we match humans to the tasks best-fitted to their expertise. This is representative of how humans normally work, but doesn’t give AI agents credit for performing much better than any single human could. There might be.
2
0
14
@BethMayBarnes
Elizabeth Barnes
4 months
Other reasons include: For tasks that both can complete, models are almost always much cheaper, and much faster in wall-clock time, than humans. This also means that there's a lot of headroom to spend more compute at test time if we have ways to productively use it - e.g. BoK.
1
0
9
@BethMayBarnes
Elizabeth Barnes
4 months
In calculating human baseline time, we only use successful baselines. However, a substantial fraction of baseline attempts result in failure. If we use human success rates to estimate the time horizon of our average baseliner, using the same methodology as for models, this comes.
1
0
11
@BethMayBarnes
Elizabeth Barnes
4 months
Some reasons we might be *underestimating* model capabilities include a subtlety around how we calculate human time.
1
0
7
@BethMayBarnes
Elizabeth Barnes
4 months
It’s obvious that there will be other factors affecting difficulty for models that aren’t captured by human time required - the domain, how obscure vs common the task is, what types of reasoning or insight are required, and how much of the time is spent on things like reading.
1
0
10
@BethMayBarnes
Elizabeth Barnes
4 months
We do find that our human time metric is a good predictor of model success rates, with an R2 of 0.83. However, it’s not perfect - there’s not a clear cutoff where models succeed at all tasks with length l and below, and fail at all longer tasks.
2
0
12
@BethMayBarnes
Elizabeth Barnes
4 months
This hopefully is predictive of agent performance (given that models have likely memorized most of the relevant background information, but won’t have training data on most individual tasks or projects), whilst maintaining an interpretable meaning (it’s hopefully intuitive what a.
1
0
11
@BethMayBarnes
Elizabeth Barnes
4 months
We’ve tried to operationalize the reference human as: a new hire, contractor or consultant; who has no prior knowledge or experience of this particular task/codebase/research question; but has all the relevant background knowledge, and is familiar with any core frameworks / tools.
1
0
16
@BethMayBarnes
Elizabeth Barnes
4 months
In some tasks taken from METR’s real work, contractors with relevant background took 10-15x longer than METR employees, spending a lot of this time reading over the codebase or reading up on the relevant tools.
1
0
15
@BethMayBarnes
Elizabeth Barnes
4 months
It’s unclear how to interpret “time needed for humans”, given that this varies wildly between different people, and is highly sensitive to expertise, existing context and experience with similar tasks. For short tasks especially, it makes a big difference whether “time to get set.
1
0
15
@BethMayBarnes
Elizabeth Barnes
4 months
Some of these are reasons to doubt the overall framing, while others point to ways we may be overestimating or underestimating current or future model capabilities.
1
0
8
@BethMayBarnes
Elizabeth Barnes
4 months
However, there are significant limitations to both the theoretical methodology and the data we were able to collect in practice. Some of these are reasons to doubt the overall framing, while others point to ways we may be overestimating or underestimating current or future model.
1
0
10
@BethMayBarnes
Elizabeth Barnes
4 months
Extrapolating this suggests that within about 5 years we will have generalist AI systems that can autonomously complete ~any software or research engineering task that a human professional could do in a few days, as well as a non-trivial fraction of multi-year projects, with no.
1
1
19