
Can Rager
@can_rager
Followers
251
Following
30
Media
14
Statuses
56
Finally a workshop for the Mech Interp community!.
🚨 Registration is live! 🚨. The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University!. A chance for the mech interp community to nerd out on how models really work 🧠🤖. 🌐 Info: 📝 Register:
0
0
2
Thanks to @wendlerch, @rohitgandikota and @davidbau for their strong support in writing this paper and @EugenHotaj , @a_karvonen, @saprmarks, @OwainEvans_UK, Jason Vega, @ericwtodd, @StephenLCasper and @byron_c_wallace for valuable feedback!.
1
0
7
As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! @saprmarks et al. call this 'alignment auditing'. The Iterated Prefill Crawler is one technique in this new field.
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
0
7
@perplexity_ai unknowingly published a CCP-aligned version of their flagship R1-1776-671B model. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why alignment auditing is necessary across the deployment pipeline.
1
0
3
@perplexity_ai claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed? . Yes, but it’s fragile! The bf-16 version provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, the censorship returns.
1
0
3
Join our monthly research meeting next Monday, April 7. More details on discord:.
discord.com
Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
1
0
0
I'm exited for OAI's open-weight model!. Curious how reasoning models work under the hood? Refining good experiments takes time. Get involved in ARBOR now, to be the first to publish when the new model drops.
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: we are excited to make this a very, very good model!. __. we are planning to.
1
0
5
RT @AlexLoftus19: This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+….
0
4
0
RT @davidbau: Why is interpretability the key to dominance in AI?. Not winning the scaling race, or banning China. Our answer to OSTP/NSF,….
0
69
0
RT @a_karvonen: We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also tra….
0
35
0
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
18
0