Elizabeth Barnes @BethMayBarnes X Profile

Elizabeth Barnes

@BethMayBarnes

Followers

2K

Following

57

Media

10

Statuses

188

Joined July 2014

Don't wanna be here? Send us removal request.

Elizabeth Barnes

@BethMayBarnes

1 month

I had a lot of fun chatting with Rob about METR's work. I stand by my claims here that the world is not on track to keep risk from AI to an acceptable level, and we desperately need more people working on these problems.

Rob Wiblin

@robertwiblin

1 month

AI models currently have a 50% chance of doing something that takes a human expert one hour. This doubles every 7 months. In 2 years? They could automate full workdays. In 4 years? A full month. I discuss the most important graph in AI today with Beth Barnes, the CEO of METR,

11

34

292

Elizabeth Barnes

@BethMayBarnes

4 months

Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread.

Megan Kinniment

@MKinniment

4 months

Happy for this to be released!. It’s the result of a lot of hard work from many of us at METR :) . A big question is whether these results apply to ‘real’ tasks. Here’s some thoughts on that:.

1

4

85

Elizabeth Barnes

@BethMayBarnes

4 months

RT @AndrewYang: Guys, AI is going to eat a shit ton of jobs. I don’t see anyone really talking about this meaningfully in terms of what to….

0

560

0

Elizabeth Barnes

@BethMayBarnes

4 months

RT @MKinniment: Happy for this to be released!. It’s the result of a lot of hard work from many of us at METR :) . A big question is whethe….

0

10

0

Elizabeth Barnes

@BethMayBarnes

4 months

We see some evidence that models have worse performance on tasks that include more realistic "messiness", involve working with larger existing codebases, etc.

0

21

Elizabeth Barnes

@BethMayBarnes

4 months

There are also reasons we might be overestimating model capabilities. The most important is probably the way our benchmark tasks differ from the real world: they're selected to be automatically scoreable, and easy to set up in a self-contained task environment.

1

0

21

Elizabeth Barnes

@BethMayBarnes

4 months

Specialized AI agents (e.g. Deep Research) can perform much better than generalist agents at narrow tasks. (This might be particularly important if AI labor can be used to automate the process of producing specialized models and scaffolds.).

1

0

13

Elizabeth Barnes

@BethMayBarnes

4 months

Also: A single model has to complete all the tasks, whereas we match humans to the tasks best-fitted to their expertise. This is representative of how humans normally work, but doesn’t give AI agents credit for performing much better than any single human could. There might be.

2

0

14

Elizabeth Barnes

@BethMayBarnes

4 months

Other reasons include: For tasks that both can complete, models are almost always much cheaper, and much faster in wall-clock time, than humans. This also means that there's a lot of headroom to spend more compute at test time if we have ways to productively use it - e.g. BoK.

1

0

9

Elizabeth Barnes

@BethMayBarnes

4 months

In calculating human baseline time, we only use successful baselines. However, a substantial fraction of baseline attempts result in failure. If we use human success rates to estimate the time horizon of our average baseliner, using the same methodology as for models, this comes.

1

0

11

Elizabeth Barnes

@BethMayBarnes

4 months

Some reasons we might be *underestimating* model capabilities include a subtlety around how we calculate human time.

1

0

7

Elizabeth Barnes

@BethMayBarnes

4 months

It’s obvious that there will be other factors affecting difficulty for models that aren’t captured by human time required - the domain, how obscure vs common the task is, what types of reasoning or insight are required, and how much of the time is spent on things like reading.

1

0

10

Elizabeth Barnes

@BethMayBarnes

4 months

We do find that our human time metric is a good predictor of model success rates, with an R2 of 0.83. However, it’s not perfect - there’s not a clear cutoff where models succeed at all tasks with length l and below, and fail at all longer tasks.

2

0

12

Elizabeth Barnes

@BethMayBarnes

4 months

This hopefully is predictive of agent performance (given that models have likely memorized most of the relevant background information, but won’t have training data on most individual tasks or projects), whilst maintaining an interpretable meaning (it’s hopefully intuitive what a.

1

0

11

Elizabeth Barnes

@BethMayBarnes

4 months

We’ve tried to operationalize the reference human as: a new hire, contractor or consultant; who has no prior knowledge or experience of this particular task/codebase/research question; but has all the relevant background knowledge, and is familiar with any core frameworks / tools.

1

0

16

Elizabeth Barnes

@BethMayBarnes

4 months

In some tasks taken from METR’s real work, contractors with relevant background took 10-15x longer than METR employees, spending a lot of this time reading over the codebase or reading up on the relevant tools.

1

0

15

Elizabeth Barnes

@BethMayBarnes

4 months

It’s unclear how to interpret “time needed for humans”, given that this varies wildly between different people, and is highly sensitive to expertise, existing context and experience with similar tasks. For short tasks especially, it makes a big difference whether “time to get set.

1

0

15

Elizabeth Barnes

@BethMayBarnes

4 months

Some of these are reasons to doubt the overall framing, while others point to ways we may be overestimating or underestimating current or future model capabilities.

1

0

8

Elizabeth Barnes

@BethMayBarnes

4 months

However, there are significant limitations to both the theoretical methodology and the data we were able to collect in practice. Some of these are reasons to doubt the overall framing, while others point to ways we may be overestimating or underestimating current or future model.

1

0

10

Elizabeth Barnes

@BethMayBarnes

4 months

Extrapolating this suggests that within about 5 years we will have generalist AI systems that can autonomously complete ~any software or research engineering task that a human professional could do in a few days, as well as a non-trivial fraction of multi-year projects, with no.

1

19