Manas Joglekar
@ManasJoglekar
Followers
270
Following
64K
Media
0
Statuses
45
Joined October 2011
Proud of our *GPT5.2 Thinking* We focused on economically valuable tasks (coding, sheets, slides) as shown by GDPval: - 71% wins+ties - 11x faster - 100x cheaper than experts. There's still a lot to improve, including UX/better connectors/reliability. It's just the beginning!
18
32
468
It is a very smart model, and we have come a long way since GPT-5.1:
2K
2K
18K
proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts:
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. https://t.co/4vgG9wS3SE
114
79
1K
OpenAI has trained its LLM to confess to bad behavior
technologyreview.com
Large language models often lie and cheat. We can’t stop that—but we can make them own up.
2
5
10
2/5 A “confession” is a special self-evaluation by the model that during training is rewarded solely for honesty. Crucially, the model is not penalized in any way for confessing honestly for bad behavior. We show that the model is far more likely to confess to bad behavior in
1
2
26
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
9
21
155
Great article by @strwbilly on our confessions work https://t.co/qK7K42viEq
technologyreview.com
Large language models often lie and cheat. We can’t stop that—but we can make them own up.
1
3
18
We trained a variant of GPT-5 Thinking to produce two outputs: (1) the main answer you see. (2) a confession focused only on honesty about compliance. The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and
87
62
700
Blog on our work
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. https://t.co/4vgG9wS3SE
2
3
124
Modern alignment 🤝 Sunday confession. Training "truth serum" for AI. Even when models learn to cheat, they’ll still admit it... https://t.co/rCcaJvZ7uG
openai.com
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
18
15
139
Really excited about this work! If we can train models to robustly self-report non-compliance, it could go a long way towards addressing scheming and related risks.
We trained a variant of GPT-5 Thinking to produce two outputs: (1) the main answer you see. (2) a confession focused only on honesty about compliance. The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and
0
3
9
This was a fun project (that I jumped into halfway through) with @ManasJoglekar, Jeremy Chen, @GabrielDWu1, @j_asminewang, @boazbaraktcs, and @mia_glaese. Boaz wrote a good casual summary: https://t.co/IqWSkhCYQD
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
1
3
10
Interested in seeing how far we can push this line of research! I think that one day, confessions could serve a similarly important role as CoT monitoring for detecting scheming and other model misbehavior.
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang
1
4
14
🕵️♂️ Beautiful new study by OpenAI It teaches models to confess when they cheat or hallucinate. This project adds an honesty head that reports the answer misbehaved. A second channel tells operators whether the answer broke rules. Instead of hiding reward hacking, the model
17
13
85
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. https://t.co/4vgG9wS3SE
openai.com
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
327
495
4K
At @OpenAI, we believe that AI can accelerate science and drug discovery. An exciting example is our work with @RetroBiosciences, where a custom model designed improved variants of the Nobel-prize winning Yamanaka proteins. Today we published a closer look at the breakthrough. ⬇️
163
646
4K
Just got back from @RaiseSummit in Paris—where AI's top names like Groq, Cerebras, Eric Schmidt, Lovable, & Windsurf all took the stage Yet one unexpected player stood out as the next breakout AGI leader.. Stats: - $300M+ in Rev (profitable!) - $225M in funding, $2.2B valuation
65
52
258
Meta’s $15B investment in ScaleAI highlights the importance of data partnerships for advancing AGI. At Turing, we remain fully neutral, serving all frontier models equally. Excited to continue to serve as a trusted research accelerator to all AI labs in need of high quality
2
28
88
It's deeply concerning that one of the best AI researchers I've worked with, @kaicathyc, was denied a U.S. green card today. A Canadian who's lived and contributed here for 12 years now has to leave. We’re risking America’s AI leadership when we turn away talent like this.
402
749
9K
We've raised a $64M Series A led by @kleinerperkins to build the platform for real-time voice AI. We'll use this funding to expand our team, and to build the next generation of models, infrastructure, and products for voice, starting with Sonic 2.0, available today. Link below
112
108
1K