Jacob Pfau
@jacob_pfau
Followers
2K
Following
28K
Media
57
Statuses
786
Alignment at UKAISI and PhD student at NYU
London
Joined June 2019
An interesting background question here is whether, for internal deployment purposes, capability improvements are becoming more or less continuous.
0
0
0
If they also know when to revisit their previous work, they can then work over arbitrary time horizons.
1
0
0
Ofc the METR methodology will break down before this, so empirically it's not very useful. But, conceptually there will be some point where models can reliably take over R&D work--training, new infra (MCPs, cacheing protocols...).
1
0
1
Then by modeling some ability to regenerate, or continuously deploy model generations, you can predict this point. Surprised I haven't seen this mentioned before, has someone written about this? The only thing that comes to mind is @TomDavidsonX 's SWE intelligence explosion.
1
0
2
I've never been compelled 'continual learning' as a bottleneck, but I do like thinking in terms of time horizons. Here's a time horizons spin on continual learning: Escape velocity: The point at which models improve by more than unit dh/dt, horizon h per wall-clock t.
2
0
2
It's difficult to correct for this and get the SotA scaffold for all datapoints. Ideally @METR_Evals could plot both the fixed and SotA scaffold points going forwards so we can check the significance of this methodological choice.
0
0
7
Labs are now putting substantial effort into optimizing (multi-)agentic scaffolds. I expect METR's fixed-scaffold time horizon estimates will increasingly underestimate capabilities.
We estimate that Claude Sonnet 4.5 has a 50%-time-horizon of around 1 hr 53 min (95% confidence interval of 50 to 235 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min.
1
0
7
The application link https://t.co/zFs4GrERli The MATS page has more info on our stream https://t.co/cdma5D63Zg 3/3
matsprogram.org
0
0
0
A (non-exhaustive) list of topics I'm interested in supervising on is here https://t.co/YgQy1Nf4Hw 2/3
alignmentproject.aisi.gov.uk
Stress-test AI agents and prove when they can’t game, sandbag or exploit rewards.
1
0
1
Apply to work with me and @ihsgnef through MATS this Winter! Deadline is Oct 2 We're focused on scalable oversight: methods, evals, safety case-ing. 1/3
1
1
2
Type of guy who excitedly shows his friends the posterior over possible photos taken during his vacation
0
0
3
I'm curious about what the right constraint is to minimize this unwanted effect. The desiderata feels similar to distribution-shift robust methods--but AFAIK these don't work very well. Probably we just have to wait for scale to solve this. In the meantime:
1
0
1
Historically pixels were inferred, but of me! I.e. the algorithm was (almost) independent from the distribution of humans in Iphone photos. Not so for approx., learned denoisers. The resulting photo is no longer purely of me. The photo is also of others.
The debate around whether every pixel in a photo from your phone's camera is "real" misses a fundamental fact about how digital cameras have always worked for the last 20 years. The camera sensor only captures ONE color (red, green, or blue) per pixel. The rest are made up 1/4
1
2
5
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
7
40
200
I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.
📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️
9
28
165
There's an ongoing societal phase transition in consumption: technology CEVs away the externalities associated with sugar, alcohol, tobacco,... via aspartame, nonalcoholic beer, vapes. I'd like a nice term for this trend, is there one? What else, like ozempic, rhymes with this?
3
0
5
Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵
1
2
8