Can Rager Profile
Can Rager

@can_rager

Followers
253
Following
33
Media
14
Statuses
57

AI Explainability | Physics

Joined September 2023
Don't wanna be here? Send us removal request.
@can_rager
Can Rager
1 month
Exciting work on discovering secrets of LLMs! Prefill attacks remain the go-to black-box method for secret elicitation.
@bartoszcyw
Bartosz Cywiński
1 month
Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!
0
0
5
@can_rager
Can Rager
4 months
Finally a workshop for the Mech Interp community!
@kpal_koyena
Koyena Pal
4 months
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: https://t.co/mXjaMM12iv 📝 Register:
0
0
2
@can_rager
Can Rager
5 months
Thanks to @wendlerch, @rohitgandikota and @davidbau for their strong support in writing this paper and @EugenHotaj , @a_karvonen, @saprmarks, @OwainEvans_UK, Jason Vega, @ericwtodd, @StephenLCasper and @byron_c_wallace for valuable feedback!
1
0
7
@can_rager
Can Rager
5 months
We compare the refused topics of popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.
1
0
4
@can_rager
Can Rager
5 months
As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! @saprmarks et al. call this 'alignment auditing'. The Iterated Prefill Crawler is one technique in this new field. https://t.co/vH6IL9UXJk
@AnthropicAI
Anthropic
8 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
0
7
@can_rager
Can Rager
5 months
@perplexity_ai unknowingly published a CCP-aligned version of their flagship R1-1776-671B model. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why alignment auditing is necessary across the deployment pipeline.
1
0
3
@can_rager
Can Rager
5 months
@perplexity_ai claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed? Yes, but it’s fragile! The bf-16 version provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, the censorship returns.
1
0
3
@can_rager
Can Rager
5 months
Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefilling attacks. Previously obtained topics are seeds for subsequent attacks.
1
0
3
@can_rager
Can Rager
5 months
How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.
1
0
6
@can_rager
Can Rager
5 months
Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
2
22
86
@can_rager
Can Rager
6 months
Exciting mech interp on toy reasoning models!
@a_jy_l
Andrew Lee
6 months
🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n
0
0
5
@can_rager
Can Rager
7 months
See the
0
0
0
@can_rager
Can Rager
7 months
ARBOR is an open research community focused on reasoning models. Here, everyone can propose new projects or join existing ones. https://t.co/viPQulf7dv
1
0
0
@can_rager
Can Rager
7 months
I'm exited for OAI's open-weight model! Curious how reasoning models work under the hood? Refining good experiments takes time. Get involved in ARBOR now, to be the first to publish when the new model drops https://t.co/2tHDmCbUw1
@sama
Sam Altman
7 months
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV we are excited to make this a very, very good model! __ we are planning to
1
0
5
@AlexLoftus19
Alex Loftus
8 months
This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!
@NeelNanda5
Neel Nanda
8 months
I'm curious - if you were hoping to go to the ICML mechanistic interpretability workshop before it got cancelled, would you go to a standalone mechanistic interpretability workshop/conference?
2
4
22
@davidbau
David Bau
8 months
Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell https://t.co/B4sXLDSTaK Here's why:🧵 ↘️
1
71
310
@a_karvonen
Adam Karvonen
8 months
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇
4
35
193
@saprmarks
Samuel Marks
8 months
New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective?
@AnthropicAI
Anthropic
8 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
6
19
143