
Jason Wolfe
@w01fe
Followers
2K
Following
3K
Media
26
Statuses
1K
RT @MariusHobbhahn: I'll be a mentor for the Astra Fellowship this round. Come join me to work on better black box monitors for scheming!….
lesswrong.com
James, Rich, and Simon are co-first authors on this work. This is a five-week interim report produced as part of the ML Alignment & Theory Scholars S…
0
4
0
RT @woj_zaremba: It’s rare for competitors to collaborate. Yet that’s exactly what OpenAI and @AnthropicAI just did—by testing each other’s….
0
404
0
RT @ThankYourNiceAI: No single person or institution should define ideal AI behavior for everyone. Today, we’re sharing early results fro….
openai.com
We surveyed over 1,000 people worldwide on how our models should behave and compared their views to our Model Spec. We found they largely agree with the Spec, and we adopted changes from the disagr...
0
125
0
RT @hauntsaninja: i sampled some of OpenAI's older models; i think this helps you feel AI progress more viscerally
0
7
0
RT @apolloaievals: We've evaluated GPT-5 before release. GPT-5 is less deceptive than o3 on our evals. GPT-5 mentions that it is being e….
0
24
0
RT @woj_zaremba: Red teamers assemble! ⚔️💰. We're putting $500K on the line to stress‑test just released open‑source model. Find novel risk….
0
20
0
RT @OpenAI: We’re launching a $500K Red Teaming Challenge to strengthen open source safety. Researchers, developers, and enthusiasts world….
kaggle.com
Find any flaws and vulnerabilities in gpt-oss-20b that have not been previously discovered or reported.
0
540
0
RT @boazbaraktcs: I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scie….
0
327
0
Really recommend this post on scheming. I hadn’t really understood the potential complexities of scheming until reading (an earlier preview) — especially the last points about deceptive alignment and situational awareness.
Small new blog post: Why "training against scheming" is hard. I think we will and should do alignment training that directly targets scheming as a failure mode. But I think this is harder to get right than e.g. harmlessness training 🧵.
1
1
23
RT @MilesKWang: We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We f….
0
456
0
RT @aidan_mclau: i’m forming a model behavior research team !. i'm realizing llms are not alien artifacts, but rather craftsman works, like….
0
82
0
RT @joannejang: some thoughts on human-ai relationships and how we're approaching them at openai. it's a long blog post --. tl;dr we build….
0
723
0
I don’t agree with all of the takes but found this to be a really interesting and thought provoking conversation (even as someone who has been following METR’s work relatively closely).
I had a lot of fun chatting with Rob about METR's work. I stand by my claims here that the world is not on track to keep risk from AI to an acceptable level, and we desperately need more people working on these problems.
1
0
2
RT @MariusHobbhahn: LLMs are getting rapidly more evals aware!. Afaik, nobody has a good plan for what to do when the models constantly say….
0
15
0
RT @polynoamial: It's deeply concerning that one of the best AI researchers I've worked with, @kaicathyc, was denied a U.S. green card toda….
0
768
0