Gabriel Wu
@GabrielDWu1
Followers
377
Following
574
Media
14
Statuses
118
alignment @openai
San Francisco, CA
Joined December 2018
An analogy: confessions are basically just a CoT monitor in which - the monitor is trained to be an accurate reporter - the actor and monitor share weights - the rollout is provided to the monitor as a prefix to the conversation, rather than embedded in a user message
0
2
5
Interested in seeing how far we can push this line of research! I think that one day, confessions could serve a similarly important role as CoT monitoring for detecting scheming and other model misbehavior.
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
1
4
14
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
7
20
153
Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. https://t.co/n3oIhyDZHd
38
136
1K
Excited to share our latest work on untangling language models by training them with extremely sparse weights! We can isolate tiny circuits inside the model responsible for various simple behaviors and understand them unprecedentedly well.
openai.com
We trained models to think in simpler, more traceable steps—so we can better understand how they work.
20
53
418
Let me know if you give this problem a try, or check out my new blog post for the solution! (it's very cool imo): https://t.co/gsXUNsxYk2 (Hint: The solution has been a feature of my X account for the past two years) (3/3)
gabrieldwu.com
A Game of Leapfrog
0
0
2
...jumpee but lands on the opposite side. At any given time, say the frogs are at positions p_1 <= p_2 <= p_3. Define their aspect ratio to be (p_2 - p_1) / (p_3 - p_1). Question: What's the limiting distribution of this aspect ratio? (2/3)
1
0
2
Three frogs are on a number line, initially equally spaced. Every second, a random frog jumps over a random other frog. When a frog at position p jumps over a frog at position q, its new position becomes p+2(q-p). That is, the jumper keeps the same distance from the... (1/3)
1
0
3
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
17
26
375
Only the US can make us ready for AGI, but Europe just made us readier. The EU's new Code of Practice is an incremental but important step towards managing rapid AI progress safely. My new piece explains what it does, why it matters, and why it's not enough.
6
16
66
Today, we’re announcing Fulcrum Research, a startup scaling human oversight. We are building debuggers that tell you why your agents fail, and what your rewards are truly testing for—the first step toward the inference-time infrastructure required to safely deploy agents.
20
29
196
Quite surprised by this!
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
1
0
10
"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen
411
1K
5K
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
164
866
5K
I just wrote a new blog post, in which I present a complexity-theoretic conjecture that is (IMO) really interesting. As I'll explain below, I think the conjecture could be compelling to theoretical computer scientists, philosophers of mathematics, and technical AI safety people!
3
16
103
What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
10
67
326
Congrats to Harvard's AI Safety Student Team for producing lots of interesting research this year! Esp to the undergrads—can be tricky to find time for research. Proud of the work @GabrielDWu1, @nikolaj2030, and many others have done in leading the group.
1
2
54
For more details on our methods and motivations, see our blog post: https://t.co/n2LBmwyYNX or full paper: https://t.co/C7YKxoK5Ie This was joint research with @JacobHHilton.
arxiv.org
We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the...
0
0
9
We are excited to see future works that 1) generalize our setting beyond single-token behaviors and 2) develop better LPE methods, especially by improving activation extrapolation (perhaps with layer-by-layer activation modeling: https://t.co/K7L0NaB0fN)
alignment.org
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause...
1
0
7
We empirically find that importance sampling methods (MHIS and ITGIS) outperform activation extrapolation methods (QLD and GLD), but both outperform naive sampling (Optimal Constant). (Each point represents performance on a different input distribution and model size.)
1
0
3