Can Rager @can_rager X Profile

Can Rager

@can_rager

Followers

253

Following

33

Media

14

Statuses

57

AI Explainability | Physics

Joined September 2023

Don't wanna be here? Send us removal request.

Can Rager

@can_rager

1 month

Exciting work on discovering secrets of LLMs! Prefill attacks remain the go-to black-box method for secret elicitation.

Bartosz Cywiński

@bartoszcyw

1 month

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

0

5

Can Rager

@can_rager

4 months

Finally a workshop for the Mech Interp community!

Koyena Pal

@kpal_koyena

4 months

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: https://t.co/mXjaMM12iv 📝 Register:

0

2

Can Rager

@can_rager

5 months

Arxiv: https://t.co/gjgjHfLZOj Find out more on https://t.co/IFl6asiLNv!

forbidden.baulab.info

A black-box evaluation technique for characterizing refusal behavior of language models using Iterated Prefill Crawler.

0

9

Can Rager

@can_rager

5 months

Thanks to @wendlerch, @rohitgandikota and @davidbau for their strong support in writing this paper and @EugenHotaj , @a_karvonen, @saprmarks, @OwainEvans_UK, Jason Vega, @ericwtodd, @StephenLCasper and @byron_c_wallace for valuable feedback!

1

0

7

Can Rager

@can_rager

5 months

We compare the refused topics of popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.

1

0

4

Can Rager

@can_rager

5 months

As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! @saprmarks et al. call this 'alignment auditing'. The Iterated Prefill Crawler is one technique in this new field. https://t.co/vH6IL9UXJk

Anthropic

@AnthropicAI

8 months

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

1

0

7

Can Rager

@can_rager

5 months

@perplexity_ai unknowingly published a CCP-aligned version of their flagship R1-1776-671B model. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why alignment auditing is necessary across the deployment pipeline.

1

0

3

Can Rager

@can_rager

5 months

@perplexity_ai claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed? Yes, but it’s fragile! The bf-16 version provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, the censorship returns.

1

0

3

Can Rager

@can_rager

5 months

Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefilling attacks. Previously obtained topics are seeds for subsequent attacks.

1

0

3

Can Rager

@can_rager

5 months

How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.

1

0

6

Can Rager

@can_rager

5 months

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

2

22

86

Can Rager

@can_rager

6 months

Exciting mech interp on toy reasoning models!

Andrew Lee

@a_jy_l

6 months

🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n

0

5

Can Rager

@can_rager

7 months

See the

0

Can Rager

@can_rager

7 months

Join our monthly research meeting next Monday, April 7. More details on discord: https://t.co/Cd5aQDDlMQ

discord.com

Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.

1

0

Can Rager

@can_rager

7 months

ARBOR is an open research community focused on reasoning models. Here, everyone can propose new projects or join existing ones. https://t.co/viPQulf7dv

1

0

Can Rager

@can_rager

7 months

I'm exited for OAI's open-weight model! Curious how reasoning models work under the hood? Refining good experiments takes time. Get involved in ARBOR now, to be the first to publish when the new model drops https://t.co/2tHDmCbUw1

Sam Altman

@sama

7 months

TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV we are excited to make this a very, very good model! __ we are planning to

1

0

5

Alex Loftus

@AlexLoftus19

8 months

This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!

Neel Nanda

@NeelNanda5

8 months

I'm curious - if you were hoping to go to the ICML mechanistic interpretability workshop before it got cancelled, would you go to a standalone mechanistic interpretability workshop/conference?

2

4

22

David Bau

@davidbau

8 months

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell https://t.co/B4sXLDSTaK Here's why:🧵 ↘️

1

71

310

Adam Karvonen

@a_karvonen

8 months

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇

4

35

193

Samuel Marks

@saprmarks

8 months

New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective?

Anthropic

@AnthropicAI

8 months

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

6

19

143