AI Safety Papers
@safe_paper
Followers
2K
Following
92
Media
75
Statuses
312
Sharing the latest in AI safety research.
arXiv
Joined May 2023
Quantifying Elicitation of Latent Capabilities in Language Models Elizabeth Donoway, @HaileyJoren, Arushi Somani, Henry Sleight (@sleight_henry), @_julianmichael_ , Michael R DeWeese, John Schulman (@johnschulman2), @EthanJPerez, @FabienDRoger, @janleike
@AnthropicAI
1
0
5
Remote Labor Index: Measuring AI Automation of Remote Work Mantas Mazeika (@MantasMazeika96), Alice Gatti, Cristina Menghini (@CriMenghini), Udari Madhushani Sehwag, Shivam Singhal (@ShivamSinghal56), Yury Orlovskiy (@yvorlovskiy), [...], Summer Yue (@summeryue0),
1
0
8
LLMs Process Lists With General Filter Heads Arnab Sen Sharma (@arnab_api), Giordano Rogers, Natalie Shapira (@NatalieShapira), David Bau (@davidbau)
2
0
9
International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications Yoshua Bengio (@Yoshua_Bengio), Stephen Clare (@stephenclare_), Carina Prunkl (@carinaprunkl), Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron
1
1
3
Agentic Misalignment: How LLMs Could Be Insider Threats Aengus Lynch (@aengus_lynch1), Benjamin Wright (@RightBenguin), Caleb Larson, @StuartJRitchie, Soren Mindermann (@sorenmind), Evan Hubinger (@EvanHub), @EthanJPerez, Kevin Troy @AnthropicAI
5
22
150
A Definition of AGI @DanHendrycks, Dawn Song (@dawnsongtweets), Christian Szegedy (@ChrSzegedy), @honglaklee, @yaringal, Erik Brynjolfsson (@erikbryn), Sharon Li (@SharonYixuanLi), Andy Zou (@andyzou_jiaming), @lionellevine, Bo Han, Jie Fu (@bigaidream), Ziwei Liu (@liuziwei7),
1
6
14
Survey on thresholds for advanced AI systems Jonas Freund (@jonasfreund_), Eunseo Choi, Kasumi Sugimoto, Bosco Hung (@boscohung_), Robert Trager (@RobertTrager), Karine Perset (@karinov) @GovAI_ @OECD
1
1
9
Eliciting Secret Knowledge from Language Models Bartosz Cywiński (@bartoszcyw), Emil Ryd (@emilaryd), Rowan Wang, Senthooran Rajamanoharan (@sen_r), Neel Nanda (@NeelNanda5), @ArthurConmy, Samuel Marks (@saprmarks) @MATSprogram
2
9
64
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models Satyapriya Krishna (@SatyaScribbles), Andy Zou (@andyzou_jiaming), Rahul Gupta (@rahul1987iit), Eliot Krzysztof Jones (@eliotkjones), Nick Winter (@nwinter), @DanHendrycks, J. Zico Kolter
1
3
5
Stress Testing Deliberative Alignment for Anti-Scheming Training Bronson Schoen, Evgenia Nitishinskaya (@j_nitishinskaya), Mikita Balesni (@balesni), Axel Højmark (@AxelHojmark), Felix Hofstätter (@HofstatterFelix), @jeremy_scheurer, Alexander Meinke (@AlexMeinke), Jason Wolfe
1
5
19