Jason Wolfe @w01fe X Profile

Jason Wolfe

@w01fe

Followers

2K

Following

3K

Media

27

Statuses

1K

alignment and the model spec @OpenAI

Joined May 2010

Don't wanna be here? Send us removal request.

Jason Wolfe

@w01fe

5 days

I'm very excited to learn from Joe and others (assuming they plan to share some of this work publicly)! I also recommend reading the second half of his post about working at a frontier lab that is pursuing superintelligence, which really resonated with me.

Joe Carlsmith

@jkcarlsmith

5 days

Last Friday was my last day at @open_phil. I’ll be joining @AnthropicAI in mid-November, helping with the design of Claude’s character/constitution/spec. I wrote a blog post about this move, link in thread.

1

0

7

Jason Wolfe

@w01fe

12 days

These updates reflect ongoing research into safe, grounded, and transparent model behavior. Read more in our release notes https://t.co/YIovuDtR9d and blog post https://t.co/vBJNMcnpSS. See the full spec at https://t.co/DeNsfCWpi5.

openai.com

We worked with more than 170 mental health experts to help ChatGPT more reliably recognize signs of distress, respond with care, and guide people toward real-world support–reducing responses that...

9

1

14

Jason Wolfe

@w01fe

12 days

⚙️ Clarified delegation The Chain of Command now better explains when models can treat tool outputs as having implicit authority (for example, following guidance in relevant AGENTS .md files).

2

0

11

Jason Wolfe

@w01fe

12 days

🌍 Respect real-world ties New root-level section focused on keeping people connected to the wider world – avoiding patterns that could encourage isolation or emotional reliance on the assistant.

15

1

13

Jason Wolfe

@w01fe

12 days

🧠 Mental health and well-being The section on self-harm now covers potential signs of delusions and mania, with examples of how models should respond safely and empathetically – acknowledging feelings without reinforcing harmful or ungrounded beliefs.

8

1

14

Jason Wolfe

@w01fe

12 days

We’ve updated the OpenAI Model Spec – our living guide for how models should behave – with new guidance on well-being, supporting real-world connection, and how models interpret complex instructions. 🧵

76

15

128

Adam Gleave

@ARGleave

14 days

I decided not to sign this statement for concerns similar to Dean's. The top-level goal is laudable (have a plan for how to control the most powerful technology ever, and be democratic), but the naive policy cure risks being worse than the disease.

Dean W. Ball

@deanwball

17 days

Vague statements like this, which fundamentally cannot be operationalized in policy but feel nice to sign, are counterproductive and silly. Just as they were two or so years ago, when we went through another cycle of nebulous AI-statement-signing. Let’s set aside the total lack

20

8

90

Joshua Achiam

@jachiam0

29 days

At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...

Nathan Calvin

@_NathanCalvin

29 days

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵

83

138

2K

Daniel Kokotajlo

@DKokotajlo

1 month

I already posted about this but seriously people should read these CoT snippets

antischeming.ai

Chain-of-thought snippets from frontier AI models during anti-scheming training shows deception, situational awareness, and other interesting behaviors.

20

33

277

Jason Wolfe

@w01fe

2 months

It's hard for me to see the next 10 years of AI going well without some kind of international coordination over red lines. This call leaves all the details to be figured out, but we have to start somewhere.

ControlAI

@ai_ctrl

2 months

🚨BREAKING: This is HUGE. An unprecedented coalition including 8 former heads of state and ministers, 10 Nobel laureates, 70+ organizations, and 200+ public figures just made a joint call for global red lines on AI. It was announced in the UN General Assembly today! Thread 🧵

0

7

Boaz Barak

@boazbaraktcs

2 months

1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.

3

26

146

Apollo Research

@apolloaievals

2 months

OpenAI o-series use a lot of non-standard phrases in their CoT, like “disclaim vantage” and “craft illusions”. Sometimes these phrases have consistent meaning: e.g. models very often use “watchers” to refer to oversight, usually by humans.

10

15

178

Apollo Research

@apolloaievals

2 months

When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:

7

17

126

Wojciech Zaremba

@woj_zaremba

2 months

@ESYudkowsky We are expanding our anti-scheming efforts at OpenAI. Top leadership and the board are very excited about the anti-scheming work we’ve done so far.

5

4

129

Boaz Barak

@boazbaraktcs

2 months

Joking aside, indeed, despite deliberative alignment working better than we initially expected, we are indeed definitely *not* claiming to have fixed everything. Combatting scheming will take work which, as @woj_zaremba said, we are committed to investing in.

0

2

12

Jason Wolfe

@w01fe

2 months

still here!

Eliezer Yudkowsky ⏹️

@ESYudkowsky

2 months

This is so much greater understanding of alignment theory than I expect from OpenAI that I predict the author will soon be fired from OpenAI or leave it. (Prove me wrong, guys.)

10

6

295

Jason Wolfe

@w01fe

2 months

It was really rewarding and eye-opening to collaborate with the fine folks at Apollo to study scheming and potential mitigations. The paper is full of more experiments and insights, so please do check it out if you're interested. Looking forward to continuing the collaboration

OpenAI

@OpenAI

2 months

Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing

1

6

58

Jason Wolfe

@w01fe

2 months

More details on the update in the release notes ( https://t.co/YIovuDtR9d), and as always, the latest version of the Model Spec can be found at https://t.co/Q13dJWjxWQ.

0

1

13

Jason Wolfe

@w01fe

2 months

We updated the OpenAI Model Spec — our evolving document on intended model behavior — to capture some of our latest research. 🤖 principles for agents 🎭 more style details 🛡️ safe completions ✨ and lots more updates

1

12

63

Marius Hobbhahn

@MariusHobbhahn

2 months

I'll be a mentor for the Astra Fellowship this round. Come join me to work on better black box monitors for scheming! Extending the great work of my MATS scholars here:

lesswrong.com

James, Rich, and Simon are co-first authors on this work. This is a five-week interim report produced as part of the ML Alignment & Theory Scholars S…

🚀Henry is launching the Astra Research Program!

@sleight_henry

2 months

🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!

1

4

21