jacob_pfau Profile Banner
Jacob Pfau Profile
Jacob Pfau

@jacob_pfau

Followers
2K
Following
28K
Media
57
Statuses
786

Alignment at UKAISI and PhD student at NYU

London
Joined June 2019
Don't wanna be here? Send us removal request.
@jacob_pfau
Jacob Pfau
20 days
An interesting background question here is whether, for internal deployment purposes, capability improvements are becoming more or less continuous.
0
0
0
@jacob_pfau
Jacob Pfau
21 days
If they also know when to revisit their previous work, they can then work over arbitrary time horizons.
1
0
0
@jacob_pfau
Jacob Pfau
21 days
Ofc the METR methodology will break down before this, so empirically it's not very useful. But, conceptually there will be some point where models can reliably take over R&D work--training, new infra (MCPs, cacheing protocols...).
1
0
1
@jacob_pfau
Jacob Pfau
21 days
Then by modeling some ability to regenerate, or continuously deploy model generations, you can predict this point. Surprised I haven't seen this mentioned before, has someone written about this? The only thing that comes to mind is @TomDavidsonX 's SWE intelligence explosion.
1
0
2
@jacob_pfau
Jacob Pfau
21 days
I've never been compelled 'continual learning' as a bottleneck, but I do like thinking in terms of time horizons. Here's a time horizons spin on continual learning: Escape velocity: The point at which models improve by more than unit dh/dt, horizon h per wall-clock t.
2
0
2
@jacob_pfau
Jacob Pfau
1 month
It's difficult to correct for this and get the SotA scaffold for all datapoints. Ideally @METR_Evals could plot both the fixed and SotA scaffold points going forwards so we can check the significance of this methodological choice.
0
0
7
@jacob_pfau
Jacob Pfau
1 month
Labs are now putting substantial effort into optimizing (multi-)agentic scaffolds. I expect METR's fixed-scaffold time horizon estimates will increasingly underestimate capabilities.
@METR_Evals
METR
1 month
We estimate that Claude Sonnet 4.5 has a 50%-time-horizon of around 1 hr 53 min (95% confidence interval of 50 to 235 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min.
1
0
7
@jacob_pfau
Jacob Pfau
1 month
The application link https://t.co/zFs4GrERli The MATS page has more info on our stream https://t.co/cdma5D63Zg 3/3
Tweet card summary image
matsprogram.org
0
0
0
@jacob_pfau
Jacob Pfau
1 month
A (non-exhaustive) list of topics I'm interested in supervising on is here https://t.co/YgQy1Nf4Hw 2/3
alignmentproject.aisi.gov.uk
Stress-test AI agents and prove when they can’t game, sandbag or exploit rewards.
1
0
1
@jacob_pfau
Jacob Pfau
1 month
Apply to work with me and @ihsgnef through MATS this Winter! Deadline is Oct 2 We're focused on scalable oversight: methods, evals, safety case-ing. 1/3
1
1
2
@jacob_pfau
Jacob Pfau
3 months
Type of guy who excitedly shows his friends the posterior over possible photos taken during his vacation
0
0
3
@jacob_pfau
Jacob Pfau
3 months
I'm curious about what the right constraint is to minimize this unwanted effect. The desiderata feels similar to distribution-shift robust methods--but AFAIK these don't work very well. Probably we just have to wait for scale to solve this. In the meantime:
1
0
1
@jacob_pfau
Jacob Pfau
3 months
Historically pixels were inferred, but of me! I.e. the algorithm was (almost) independent from the distribution of humans in Iphone photos. Not so for approx., learned denoisers. The resulting photo is no longer purely of me. The photo is also of others.
@docmilanfar
Peyman Milanfar
3 months
The debate around whether every pixel in a photo from your phone's camera is "real" misses a fundamental fact about how digital cameras have always worked for the last 20 years. The camera sensor only captures ONE color (red, green, or blue) per pixel. The rest are made up 1/4
1
2
5
@StephenLCasper
Cas (Stephen Casper)
3 months
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
7
40
200
@geoffreyirving
Geoffrey Irving
3 months
I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.
@AISecurityInst
AI Security Institute
3 months
📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️
9
28
165
@jacob_pfau
Jacob Pfau
3 months
Apotheosis of the simulacrum, clean hedonism, ....?
0
0
1
@jacob_pfau
Jacob Pfau
3 months
There's an ongoing societal phase transition in consumption: technology CEVs away the externalities associated with sugar, alcohol, tobacco,... via aspartame, nonalcoholic beer, vapes. I'd like a nice term for this trend, is there one? What else, like ozempic, rhymes with this?
3
0
5
@geoffreyirving
Geoffrey Irving
5 months
Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵
1
2
8