Jacob Pfau @jacob_pfau X Profile

Jacob Pfau

@jacob_pfau

Followers

2K

Following

28K

Media

57

Statuses

786

Alignment at UKAISI and PhD student at NYU

https://t.co/wKUbFsHUZE

London

Joined June 2019

Don't wanna be here? Send us removal request.

Jacob Pfau

@jacob_pfau

20 days

An interesting background question here is whether, for internal deployment purposes, capability improvements are becoming more or less continuous.

0

Jacob Pfau

@jacob_pfau

21 days

If they also know when to revisit their previous work, they can then work over arbitrary time horizons.

1

0

Jacob Pfau

@jacob_pfau

21 days

Ofc the METR methodology will break down before this, so empirically it's not very useful. But, conceptually there will be some point where models can reliably take over R&D work--training, new infra (MCPs, cacheing protocols...).

1

0

1

Jacob Pfau

@jacob_pfau

21 days

Then by modeling some ability to regenerate, or continuously deploy model generations, you can predict this point. Surprised I haven't seen this mentioned before, has someone written about this? The only thing that comes to mind is @TomDavidsonX 's SWE intelligence explosion.

1

0

2

Jacob Pfau

@jacob_pfau

21 days

I've never been compelled 'continual learning' as a bottleneck, but I do like thinking in terms of time horizons. Here's a time horizons spin on continual learning: Escape velocity: The point at which models improve by more than unit dh/dt, horizon h per wall-clock t.

2

0

2

Jacob Pfau

@jacob_pfau

1 month

It's difficult to correct for this and get the SotA scaffold for all datapoints. Ideally @METR_Evals could plot both the fixed and SotA scaffold points going forwards so we can check the significance of this methodological choice.

0

7

Jacob Pfau

@jacob_pfau

1 month

Labs are now putting substantial effort into optimizing (multi-)agentic scaffolds. I expect METR's fixed-scaffold time horizon estimates will increasingly underestimate capabilities.

METR

@METR_Evals

1 month

We estimate that Claude Sonnet 4.5 has a 50%-time-horizon of around 1 hr 53 min (95% confidence interval of 50 to 235 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min.

1

0

7

Jacob Pfau

@jacob_pfau

1 month

The application link https://t.co/zFs4GrERli The MATS page has more info on our stream https://t.co/cdma5D63Zg 3/3

matsprogram.org

0

Jacob Pfau

@jacob_pfau

1 month

A (non-exhaustive) list of topics I'm interested in supervising on is here https://t.co/YgQy1Nf4Hw 2/3

alignmentproject.aisi.gov.uk

Stress-test AI agents and prove when they can’t game, sandbag or exploit rewards.

1

0

1

Jacob Pfau

@jacob_pfau

1 month

Apply to work with me and @ihsgnef through MATS this Winter! Deadline is Oct 2 We're focused on scalable oversight: methods, evals, safety case-ing. 1/3

1

2

Jacob Pfau

@jacob_pfau

3 months

Type of guy who excitedly shows his friends the posterior over possible photos taken during his vacation

0

3

Jacob Pfau

@jacob_pfau

3 months

https://t.co/0DhLjSWlTH

reddit.com

Explore this post and more from the Android community

1

0

1

Jacob Pfau

@jacob_pfau

3 months

I'm curious about what the right constraint is to minimize this unwanted effect. The desiderata feels similar to distribution-shift robust methods--but AFAIK these don't work very well. Probably we just have to wait for scale to solve this. In the meantime:

1

0

1

Jacob Pfau

@jacob_pfau

3 months

Historically pixels were inferred, but of me! I.e. the algorithm was (almost) independent from the distribution of humans in Iphone photos. Not so for approx., learned denoisers. The resulting photo is no longer purely of me. The photo is also of others.

Peyman Milanfar

@docmilanfar

3 months

The debate around whether every pixel in a photo from your phone's camera is "real" misses a fundamental fact about how digital cameras have always worked for the last 20 years. The camera sensor only captures ONE color (red, green, or blue) per pixel. The rest are made up 1/4

1

2

5

Cas (Stephen Casper)

@StephenLCasper

3 months

🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.

7

40

200

Geoffrey Irving

@geoffreyirving

3 months

I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.

AI Security Institute

@AISecurityInst

3 months

📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️

9

28

165

Jacob Pfau

@jacob_pfau

3 months

Apotheosis of the simulacrum, clean hedonism, ....?

0

1

Jacob Pfau

@jacob_pfau

3 months

There's an ongoing societal phase transition in consumption: technology CEVs away the externalities associated with sugar, alcohol, tobacco,... via aspartame, nonalcoholic beer, vapes. I'd like a nice term for this trend, is there one? What else, like ozempic, rhymes with this?

3

0

5

Geoffrey Irving

@geoffreyirving

5 months

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵

1

2

8