
AI Safety Papers
@safe_paper
Followers
2K
Following
64
Media
59
Statuses
280
Sharing the latest in AI safety and interpretability research.
Joined May 2023
Persona Features Control Emergent Misalignment. Miles Wang (@MilesKWang), Tom Dupré la Tour (@tomdlt10), Olivia Watkins, Alex Makelov (@AMakelov), Ryan A. Chi (@ryanandrewchi), Samuel Miserendino (@samuelp1002), Johannes Heidecke (@JoHeidecke), Tejal Patwardhan
1
0
6
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. Joel Becker (@joel_bkr), Nate Rush, Beth Barnes (@BethMayBarnes), David Rein (@idavidrein). @METR_Evals
2
0
19
Why Do Some Language Models Fake Alignment While Others Don't?.Abhay Sheshadri (@abhayesian), John Hughes (@jplhughes), Julian Michael (@_julianmichael_ ), Alex Mallen (@alextmallen), Arun Jose (@jozdien), Janus (@repligate), Fabien Roger (@FabienDRoger)
3
12
100
Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments.Christoph Schnabl (@inxoy_), Daniel Hugenroth (@lambdapioneer), Bill Marino, Alastair R. Beresford (@arberesford)
2
2
30
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models. James Chua (@jameschua_sg), Jan Betley (@BetleyJan), Mia Taylor, Owain Evans (@OwainEvans_UK)
1
2
19
Because we have LLMs, we Can and Should Pursue Agentic Interpretability. Been Kim (@_beenkim), John Hewitt (@johnhewtt), Neel Nanda (@NeelNanda5), Noah Fiedel, Oyvind Tafjord
1
3
22
Distributional Training Data Attribution.Bruno Mlodozeniec (@kayembruno), Isaac Reid (@isaac_o_reid), Sam Power (@sp_monte_carlo), David Krueger (@DavidSKrueger), Murat Erdogdu (@MuratAErdogdu), Richard E. Turner, @RogerGrosse
1
2
12
Unsupervised Elicitation of Language Models. Jiaxin Wen (@jiaxinwen22), Zachary Ankner (@ZackAnkner), Arushi Somani (@amksomani), Peter Hase (@peterbhase), Samuel Marks (@saprmarks), Jacob Goldman-Wetzler, Linda Petrini (@petrini_linda), Henry Sleight (@sleight_henry), Collin
1
2
6
Avoiding Obfuscation with Prover-Estimator Debate.Jonah Brown-Cohen, @geoffreyirving, Georgios Piliouras. @GoogleDeepMind @AISecurityInst
1
1
11
Large Language Models Often Know When They Are Being Evaluated.Joe Needham, Giles Edkins (@gdedkins), Govind Pimpale (@GovindPimpale), Henning Bartsch, @MariusHobbhahn. @apolloaievals @matsprogram
3
23
173