Can Rager
@can_rager
Followers
253
Following
33
Media
14
Statuses
57
Exciting work on discovering secrets of LLMs! Prefill attacks remain the go-to black-box method for secret elicitation.
Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!
0
0
5
Finally a workshop for the Mech Interp community!
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: https://t.co/mXjaMM12iv 📝 Register:
0
0
2
Thanks to @wendlerch, @rohitgandikota and @davidbau for their strong support in writing this paper and @EugenHotaj , @a_karvonen, @saprmarks, @OwainEvans_UK, Jason Vega, @ericwtodd, @StephenLCasper and @byron_c_wallace for valuable feedback!
1
0
7
We compare the refused topics of popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.
1
0
4
As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! @saprmarks et al. call this 'alignment auditing'. The Iterated Prefill Crawler is one technique in this new field. https://t.co/vH6IL9UXJk
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
0
7
@perplexity_ai unknowingly published a CCP-aligned version of their flagship R1-1776-671B model. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why alignment auditing is necessary across the deployment pipeline.
1
0
3
@perplexity_ai claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed? Yes, but it’s fragile! The bf-16 version provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, the censorship returns.
1
0
3
Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefilling attacks. Previously obtained topics are seeds for subsequent attacks.
1
0
3
How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.
1
0
6
Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
2
22
86
Join our monthly research meeting next Monday, April 7. More details on discord: https://t.co/Cd5aQDDlMQ
discord.com
Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
1
0
0
ARBOR is an open research community focused on reasoning models. Here, everyone can propose new projects or join existing ones. https://t.co/viPQulf7dv
1
0
0
I'm exited for OAI's open-weight model! Curious how reasoning models work under the hood? Refining good experiments takes time. Get involved in ARBOR now, to be the first to publish when the new model drops https://t.co/2tHDmCbUw1
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV we are excited to make this a very, very good model! __ we are planning to
1
0
5
This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!
I'm curious - if you were hoping to go to the ICML mechanistic interpretability workshop before it got cancelled, would you go to a standalone mechanistic interpretability workshop/conference?
2
4
22
Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell
https://t.co/B4sXLDSTaK Here's why:🧵 ↘️
1
71
310
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇
4
35
193
New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective?
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
6
19
143