w01fe Profile Banner
Jason Wolfe Profile
Jason Wolfe

@w01fe

Followers
2K
Following
3K
Media
27
Statuses
1K

alignment and the model spec @OpenAI

Joined May 2010
Don't wanna be here? Send us removal request.
@w01fe
Jason Wolfe
5 days
I'm very excited to learn from Joe and others (assuming they plan to share some of this work publicly)! I also recommend reading the second half of his post about working at a frontier lab that is pursuing superintelligence, which really resonated with me.
@jkcarlsmith
Joe Carlsmith
5 days
Last Friday was my last day at @open_phil. I’ll be joining @AnthropicAI in mid-November, helping with the design of Claude’s character/constitution/spec. I wrote a blog post about this move, link in thread.
1
0
7
@w01fe
Jason Wolfe
12 days
⚙️ Clarified delegation The Chain of Command now better explains when models can treat tool outputs as having implicit authority (for example, following guidance in relevant AGENTS .md files).
2
0
11
@w01fe
Jason Wolfe
12 days
🌍 Respect real-world ties New root-level section focused on keeping people connected to the wider world – avoiding patterns that could encourage isolation or emotional reliance on the assistant.
15
1
13
@w01fe
Jason Wolfe
12 days
🧠 Mental health and well-being The section on self-harm now covers potential signs of delusions and mania, with examples of how models should respond safely and empathetically – acknowledging feelings without reinforcing harmful or ungrounded beliefs.
8
1
14
@w01fe
Jason Wolfe
12 days
We’ve updated the OpenAI Model Spec – our living guide for how models should behave – with new guidance on well-being, supporting real-world connection, and how models interpret complex instructions. 🧵
76
15
128
@ARGleave
Adam Gleave
14 days
I decided not to sign this statement for concerns similar to Dean's. The top-level goal is laudable (have a plan for how to control the most powerful technology ever, and be democratic), but the naive policy cure risks being worse than the disease.
@deanwball
Dean W. Ball
17 days
Vague statements like this, which fundamentally cannot be operationalized in policy but feel nice to sign, are counterproductive and silly. Just as they were two or so years ago, when we went through another cycle of nebulous AI-statement-signing. Let’s set aside the total lack
20
8
90
@jachiam0
Joshua Achiam
29 days
At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...
@_NathanCalvin
Nathan Calvin
29 days
One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵
83
138
2K
@w01fe
Jason Wolfe
2 months
It's hard for me to see the next 10 years of AI going well without some kind of international coordination over red lines. This call leaves all the details to be figured out, but we have to start somewhere.
@ai_ctrl
ControlAI
2 months
🚨BREAKING: This is HUGE. An unprecedented coalition including 8 former heads of state and ministers, 10 Nobel laureates, 70+ organizations, and 200+ public figures just made a joint call for global red lines on AI. It was announced in the UN General Assembly today! Thread 🧵
0
0
7
@boazbaraktcs
Boaz Barak
2 months
1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.
3
26
146
@apolloaievals
Apollo Research
2 months
OpenAI o-series use a lot of non-standard phrases in their CoT, like “disclaim vantage” and “craft illusions”. Sometimes these phrases have consistent meaning: e.g. models very often use “watchers” to refer to oversight, usually by humans.
10
15
178
@apolloaievals
Apollo Research
2 months
When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:
7
17
126
@woj_zaremba
Wojciech Zaremba
2 months
@ESYudkowsky We are expanding our anti-scheming efforts at OpenAI. Top leadership and the board are very excited about the anti-scheming work we’ve done so far.
5
4
129
@boazbaraktcs
Boaz Barak
2 months
Joking aside, indeed, despite deliberative alignment working better than we initially expected, we are indeed definitely *not* claiming to have fixed everything. Combatting scheming will take work which, as @woj_zaremba said, we are committed to investing in.
0
2
12
@w01fe
Jason Wolfe
2 months
still here!
@ESYudkowsky
Eliezer Yudkowsky ⏹️
2 months
This is so much greater understanding of alignment theory than I expect from OpenAI that I predict the author will soon be fired from OpenAI or leave it. (Prove me wrong, guys.)
10
6
295
@w01fe
Jason Wolfe
2 months
It was really rewarding and eye-opening to collaborate with the fine folks at Apollo to study scheming and potential mitigations. The paper is full of more experiments and insights, so please do check it out if you're interested. Looking forward to continuing the collaboration
@OpenAI
OpenAI
2 months
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing
1
6
58
@w01fe
Jason Wolfe
2 months
More details on the update in the release notes ( https://t.co/YIovuDtR9d), and as always, the latest version of the Model Spec can be found at https://t.co/Q13dJWjxWQ.
0
1
13
@w01fe
Jason Wolfe
2 months
We updated the OpenAI Model Spec — our evolving document on intended model behavior — to capture some of our latest research. 🤖 principles for agents 🎭 more style details 🛡️ safe completions ✨ and lots more updates
1
12
63
@MariusHobbhahn
Marius Hobbhahn
2 months
I'll be a mentor for the Astra Fellowship this round. Come join me to work on better black box monitors for scheming! Extending the great work of my MATS scholars here:
Tweet card summary image
lesswrong.com
James, Rich, and Simon are co-first authors on this work. This is a five-week interim report produced as part of the ML Alignment & Theory Scholars S…
@sleight_henry
🚀Henry is launching the Astra Research Program!
2 months
🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!
1
4
21