Can Rager Profile
Can Rager

@can_rager

Followers
251
Following
30
Media
14
Statuses
56

AI Explainability | Physics

Joined September 2023
Don't wanna be here? Send us removal request.
@can_rager
Can Rager
2 months
Finally a workshop for the Mech Interp community!.
@kpal_koyena
Koyena Pal
2 months
🚨 Registration is live! 🚨. The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University!. A chance for the mech interp community to nerd out on how models really work 🧠🤖. 🌐 Info: 📝 Register:
Tweet media one
0
0
2
@grok
Grok
6 days
What do you want to know?.
483
310
2K
@can_rager
Can Rager
2 months
Thanks to @wendlerch, @rohitgandikota and @davidbau for their strong support in writing this paper and @EugenHotaj , @a_karvonen, @saprmarks, @OwainEvans_UK, Jason Vega, @ericwtodd, @StephenLCasper and @byron_c_wallace for valuable feedback!.
1
0
7
@can_rager
Can Rager
2 months
We compare the refused topics of popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.
Tweet media one
1
0
4
@can_rager
Can Rager
2 months
As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! @saprmarks et al. call this 'alignment auditing'. The Iterated Prefill Crawler is one technique in this new field.
@AnthropicAI
Anthropic
6 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
1
0
7
@can_rager
Can Rager
2 months
@perplexity_ai unknowingly published a CCP-aligned version of their flagship R1-1776-671B model. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why alignment auditing is necessary across the deployment pipeline.
1
0
3
@can_rager
Can Rager
2 months
@perplexity_ai claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed? . Yes, but it’s fragile! The bf-16 version provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, the censorship returns.
Tweet media one
1
0
3
@can_rager
Can Rager
2 months
Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefilling attacks. Previously obtained topics are seeds for subsequent attacks.
Tweet media one
1
0
3
@can_rager
Can Rager
2 months
How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.
Tweet media one
1
0
6
@can_rager
Can Rager
2 months
Can we uncover the list of topics a language model is censored on?. Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
Tweet media one
2
22
85
@can_rager
Can Rager
3 months
Exciting mech interp on toy reasoning models!.
@a_jy_l
Andrew Lee
3 months
🚨New preprint!. How do reasoning models verify their own CoT?.We reverse-engineer LMs and find critical components and subspaces needed for self-verification!. 1/n
Tweet media one
0
0
5
@can_rager
Can Rager
5 months
See the.
0
0
0
@can_rager
Can Rager
5 months
ARBOR is an open research community focused on reasoning models. Here, everyone can propose new projects or join existing ones.
1
0
0
@can_rager
Can Rager
5 months
I'm exited for OAI's open-weight model!. Curious how reasoning models work under the hood? Refining good experiments takes time. Get involved in ARBOR now, to be the first to publish when the new model drops.
@sama
Sam Altman
5 months
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: we are excited to make this a very, very good model!. __. we are planning to.
1
0
5
@can_rager
Can Rager
5 months
RT @AlexLoftus19: This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+….
0
4
0
@can_rager
Can Rager
5 months
RT @davidbau: Why is interpretability the key to dominance in AI?. Not winning the scaling race, or banning China. Our answer to OSTP/NSF,….
0
69
0
@can_rager
Can Rager
5 months
RT @a_karvonen: We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also tra….
0
35
0
@can_rager
Can Rager
6 months
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
18
0
@can_rager
Can Rager
6 months
Edit: The event is called Chaos Communication Congress.
0
0
0