Jason Wolfe
@w01fe
Followers
2K
Following
3K
Media
27
Statuses
1K
I'm very excited to learn from Joe and others (assuming they plan to share some of this work publicly)! I also recommend reading the second half of his post about working at a frontier lab that is pursuing superintelligence, which really resonated with me.
Last Friday was my last day at @open_phil. I’ll be joining @AnthropicAI in mid-November, helping with the design of Claude’s character/constitution/spec. I wrote a blog post about this move, link in thread.
1
0
7
These updates reflect ongoing research into safe, grounded, and transparent model behavior. Read more in our release notes https://t.co/YIovuDtR9d and blog post https://t.co/vBJNMcnpSS. See the full spec at https://t.co/DeNsfCWpi5.
openai.com
We worked with more than 170 mental health experts to help ChatGPT more reliably recognize signs of distress, respond with care, and guide people toward real-world support–reducing responses that...
9
1
14
⚙️ Clarified delegation The Chain of Command now better explains when models can treat tool outputs as having implicit authority (for example, following guidance in relevant AGENTS .md files).
2
0
11
🌍 Respect real-world ties New root-level section focused on keeping people connected to the wider world – avoiding patterns that could encourage isolation or emotional reliance on the assistant.
15
1
13
🧠 Mental health and well-being The section on self-harm now covers potential signs of delusions and mania, with examples of how models should respond safely and empathetically – acknowledging feelings without reinforcing harmful or ungrounded beliefs.
8
1
14
We’ve updated the OpenAI Model Spec – our living guide for how models should behave – with new guidance on well-being, supporting real-world connection, and how models interpret complex instructions. 🧵
76
15
128
I decided not to sign this statement for concerns similar to Dean's. The top-level goal is laudable (have a plan for how to control the most powerful technology ever, and be democratic), but the naive policy cure risks being worse than the disease.
Vague statements like this, which fundamentally cannot be operationalized in policy but feel nice to sign, are counterproductive and silly. Just as they were two or so years ago, when we went through another cycle of nebulous AI-statement-signing. Let’s set aside the total lack
20
8
90
At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...
One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵
83
138
2K
I already posted about this but seriously people should read these CoT snippets
antischeming.ai
Chain-of-thought snippets from frontier AI models during anti-scheming training shows deception, situational awareness, and other interesting behaviors.
20
33
277
It's hard for me to see the next 10 years of AI going well without some kind of international coordination over red lines. This call leaves all the details to be figured out, but we have to start somewhere.
🚨BREAKING: This is HUGE. An unprecedented coalition including 8 former heads of state and ministers, 10 Nobel laureates, 70+ organizations, and 200+ public figures just made a joint call for global red lines on AI. It was announced in the UN General Assembly today! Thread 🧵
0
0
7
1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.
3
26
146
OpenAI o-series use a lot of non-standard phrases in their CoT, like “disclaim vantage” and “craft illusions”. Sometimes these phrases have consistent meaning: e.g. models very often use “watchers” to refer to oversight, usually by humans.
10
15
178
When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:
7
17
126
@ESYudkowsky We are expanding our anti-scheming efforts at OpenAI. Top leadership and the board are very excited about the anti-scheming work we’ve done so far.
5
4
129
Joking aside, indeed, despite deliberative alignment working better than we initially expected, we are indeed definitely *not* claiming to have fixed everything. Combatting scheming will take work which, as @woj_zaremba said, we are committed to investing in.
0
2
12
It was really rewarding and eye-opening to collaborate with the fine folks at Apollo to study scheming and potential mitigations. The paper is full of more experiments and insights, so please do check it out if you're interested. Looking forward to continuing the collaboration
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing
1
6
58
More details on the update in the release notes ( https://t.co/YIovuDtR9d), and as always, the latest version of the Model Spec can be found at https://t.co/Q13dJWjxWQ.
0
1
13
We updated the OpenAI Model Spec — our evolving document on intended model behavior — to capture some of our latest research. 🤖 principles for agents 🎭 more style details 🛡️ safe completions ✨ and lots more updates
1
12
63
I'll be a mentor for the Astra Fellowship this round. Come join me to work on better black box monitors for scheming! Extending the great work of my MATS scholars here:
lesswrong.com
James, Rich, and Simon are co-first authors on this work. This is a five-week interim report produced as part of the ML Alignment & Theory Scholars S…
🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!
1
4
21