AI Safety Papers @safe_paper X Profile

AI Safety Papers

@safe_paper

Followers

2K

Following

92

Media

75

Statuses

312

Sharing the latest in AI safety research.

arXiv

Joined May 2023

Don't wanna be here? Send us removal request.

AI Safety Papers

@safe_paper

6 days

https://t.co/oBLCQTcSxt

openreview.net

Large language models often possess latent capabilities that lie dormant unless explicitly elicited, or surfaced, through fine-tuning or prompt engineering. Predicting, assessing, and understanding...

0

1

AI Safety Papers

@safe_paper

6 days

Quantifying Elicitation of Latent Capabilities in Language Models Elizabeth Donoway, @HaileyJoren, Arushi Somani, Henry Sleight (@sleight_henry), @_julianmichael_ , Michael R DeWeese, John Schulman (@johnschulman2), @EthanJPerez, @FabienDRoger, @janleike @AnthropicAI

1

0

5

AI Safety Papers

@safe_paper

14 days

https://t.co/XdOQrTpyjj

arxiv.org

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we...

0

AI Safety Papers

@safe_paper

14 days

Remote Labor Index: Measuring AI Automation of Remote Work Mantas Mazeika (@MantasMazeika96), Alice Gatti, Cristina Menghini (@CriMenghini), Udari Madhushani Sehwag, Shivam Singhal (@ShivamSinghal56), Yury Orlovskiy (@yvorlovskiy), [...], Summer Yue (@summeryue0),

1

0

8

AI Safety Papers

@safe_paper

14 days

https://t.co/patSn9DC0B

arxiv.org

We investigate the mechanisms underlying a range of list-processing tasks in LLMs, and we find that LLMs have learned to encode a compact, causal representation of a general filtering operation...

0

1

AI Safety Papers

@safe_paper

14 days

LLMs Process Lists With General Filter Heads Arnab Sen Sharma (@arnab_api), Giordano Rogers, Natalie Shapira (@NatalieShapira), David Bau (@davidbau)

2

0

9

AI Safety Papers

@safe_paper

26 days

https://t.co/ppGDv4n0ZU

arxiv.org

Since the publication of the first International AI Safety Report, AI capabilities have continued to improve across key domains. New training techniques that teach AI systems to reason...

0

2

AI Safety Papers

@safe_paper

26 days

International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications Yoshua Bengio (@Yoshua_Bengio), Stephen Clare (@stephenclare_), Carina Prunkl (@carinaprunkl), Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron

1

3

AI Safety Papers

@safe_paper

30 days

@aengus_lynch1 @RightBenguin @StuartJRitchie @sorenmind @EvanHub @EthanJPerez @AnthropicAI

arxiv.org

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we...

0

1

4

AI Safety Papers

@safe_paper

30 days

Agentic Misalignment: How LLMs Could Be Insider Threats Aengus Lynch (@aengus_lynch1), Benjamin Wright (@RightBenguin), Caleb Larson, @StuartJRitchie, Soren Mindermann (@sorenmind), Evan Hubinger (@EvanHub), @EthanJPerez, Kevin Troy @AnthropicAI

5

22

150

AI Safety Papers

@safe_paper

1 month

https://t.co/t4nnoLPzKd

0

2

AI Safety Papers

@safe_paper

1 month

A Definition of AGI @DanHendrycks, Dawn Song (@dawnsongtweets), Christian Szegedy (@ChrSzegedy), @honglaklee, @yaringal, Erik Brynjolfsson (@erikbryn), Sharon Li (@SharonYixuanLi), Andy Zou (@andyzou_jiaming), @lionellevine, Bo Han, Jie Fu (@bigaidream), Ziwei Liu (@liuziwei7),

1

6

14

AI Safety Papers

@safe_paper

1 month

https://t.co/myYDdx7SEI

0

AI Safety Papers

@safe_paper

1 month

Survey on thresholds for advanced AI systems Jonas Freund (@jonasfreund_), Eunseo Choi, Kasumi Sugimoto, Bosco Hung (@boscohung_), Robert Trager (@RobertTrager), Karine Perset (@karinov) @GovAI_ @OECD

1

9

AI Safety Papers

@safe_paper

1 month

https://t.co/QcGAJlBIyE

arxiv.org

We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific...

0

1

3

AI Safety Papers

@safe_paper

1 month

Eliciting Secret Knowledge from Language Models Bartosz Cywiński (@bartoszcyw), Emil Ryd (@emilaryd), Rowan Wang, Senthooran Rajamanoharan (@sen_r), Neel Nanda (@NeelNanda5), @ArthurConmy, Samuel Marks (@saprmarks) @MATSprogram

2

9

64

AI Safety Papers

@safe_paper

2 months

@SatyaScribbles @andyzou_jiaming @rahul1987iit @eliotkjones @nwinter @DanHendrycks

arxiv.org

The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful...

0

AI Safety Papers

@safe_paper

2 months

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models Satyapriya Krishna (@SatyaScribbles), Andy Zou (@andyzou_jiaming), Rahul Gupta (@rahul1987iit), Eliot Krzysztof Jones (@eliotkjones), Nick Winter (@nwinter), @DanHendrycks, J. Zico Kolter

1

3

5

AI Safety Papers

@safe_paper

2 months

https://t.co/SHIyS7imeC

antischeming.ai

Apollo Research & OpenAI find that anti-scheming training in frontier AI models significantly reduced covert behaviours, but did not eliminate them.

0

AI Safety Papers

@safe_paper

2 months

Stress Testing Deliberative Alignment for Anti-Scheming Training Bronson Schoen, Evgenia Nitishinskaya (@j_nitishinskaya), Mikita Balesni (@balesni), Axel Højmark (@AxelHojmark), Felix Hofstätter (@HofstatterFelix), @jeremy_scheurer, Alexander Meinke (@AlexMeinke), Jason Wolfe

1

5

19