GabrielDWu1 Profile Banner
Gabriel Wu Profile
Gabriel Wu

@GabrielDWu1

Followers
377
Following
574
Media
14
Statuses
118

alignment @openai

San Francisco, CA
Joined December 2018
Don't wanna be here? Send us removal request.
@GabrielDWu1
Gabriel Wu
3 days
An analogy: confessions are basically just a CoT monitor in which - the monitor is trained to be an accurate reporter - the actor and monitor share weights - the rollout is provided to the monitor as a prefix to the conversation, rather than embedded in a user message
0
2
5
@GabrielDWu1
Gabriel Wu
3 days
Interested in seeing how far we can push this line of research! I think that one day, confessions could serve a similarly important role as CoT monitoring for detecting scheming and other model misbehavior.
@boazbaraktcs
Boaz Barak
3 days
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
1
4
14
@boazbaraktcs
Boaz Barak
3 days
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
7
20
153
@j_asminewang
Jasmine @ NeurIPS 11/30-12/6
5 days
Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. https://t.co/n3oIhyDZHd
38
136
1K
@nabla_theta
Leo Gao
23 days
Excited to share our latest work on untangling language models by training them with extremely sparse weights! We can isolate tiny circuits inside the model responsible for various simple behaviors and understand them unprecedentedly well.
Tweet card summary image
openai.com
We trained models to think in simpler, more traceable steps—so we can better understand how they work.
20
53
418
@GabrielDWu1
Gabriel Wu
2 months
Let me know if you give this problem a try, or check out my new blog post for the solution! (it's very cool imo): https://t.co/gsXUNsxYk2 (Hint: The solution has been a feature of my X account for the past two years) (3/3)
gabrieldwu.com
A Game of Leapfrog
0
0
2
@GabrielDWu1
Gabriel Wu
2 months
...jumpee but lands on the opposite side. At any given time, say the frogs are at positions p_1 <= p_2 <= p_3. Define their aspect ratio to be (p_2 - p_1) / (p_3 - p_1). Question: What's the limiting distribution of this aspect ratio? (2/3)
1
0
2
@GabrielDWu1
Gabriel Wu
2 months
Three frogs are on a number line, initially equally spaced. Every second, a random frog jumps over a random other frog. When a frog at position p jumps over a frog at position q, its new position becomes p+2(q-p). That is, the jumper keeps the same distance from the... (1/3)
1
0
3
@RyanPGreenblatt
Ryan Greenblatt
2 months
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
17
26
375
@Miles_M_K
Miles Kodama
3 months
Only the US can make us ready for AGI, but Europe just made us readier. The EU's new Code of Practice is an incremental but important step towards managing rapid AI progress safely. My new piece explains what it does, why it matters, and why it's not enough.
6
16
66
@fulcrumML
Fulcrum
3 months
Today, we’re announcing Fulcrum Research, a startup scaling human oversight. We are building debuggers that tell you why your agents fail, and what your rewards are truly testing for—the first step toward the inference-time infrastructure required to safely deploy agents.
20
29
196
@GabrielDWu1
Gabriel Wu
5 months
Quite surprised by this!
@METR_Evals
METR
5 months
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
1
0
10
@DKokotajlo
Daniel Kokotajlo
8 months
"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen
411
1K
5K
@METR_Evals
METR
9 months
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
164
866
5K
@ericneyman
Eric Neyman
10 months
I just wrote a new blog post, in which I present a complexity-theoretic conjecture that is (IMO) really interesting. As I'll explain below, I think the conjecture could be compelling to theoretical computer scientists, philosophers of mathematics, and technical AI safety people!
3
16
103
@saprmarks
Samuel Marks
11 months
What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
10
67
326
@alxndrdavies
Xander Davies
11 months
Congrats to Harvard's AI Safety Student Team for producing lots of interesting research this year! Esp to the undergrads—can be tricky to find time for research. Proud of the work @GabrielDWu1, @nikolaj2030, and many others have done in leading the group.
1
2
54
@GabrielDWu1
Gabriel Wu
1 year
We are excited to see future works that 1) generalize our setting beyond single-token behaviors and 2) develop better LPE methods, especially by improving activation extrapolation (perhaps with layer-by-layer activation modeling: https://t.co/K7L0NaB0fN)
alignment.org
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause...
1
0
7
@GabrielDWu1
Gabriel Wu
1 year
We empirically find that importance sampling methods (MHIS and ITGIS) outperform activation extrapolation methods (QLD and GLD), but both outperform naive sampling (Optimal Constant). (Each point represents performance on a different input distribution and model size.)
1
0
3