sen_r Profile Banner
Senthooran Rajamanoharan Profile
Senthooran Rajamanoharan

@sen_r

Followers
309
Following
12
Media
6
Statuses
28

London, UK
Joined April 2010
Don't wanna be here? Send us removal request.
@sen_r
Senthooran Rajamanoharan
1 month
RT @NeelNanda5: GDM interp work: Do LLMs have self-preservation?. Concerning recent work: models may block shutdown if it interferes with t….
0
21
0
@sen_r
Senthooran Rajamanoharan
2 months
RT @emmons_scott: Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When w….
0
40
0
@grok
Grok
14 hours
Join millions who have switched to Grok.
52
84
757
@sen_r
Senthooran Rajamanoharan
2 months
RT @IvanArcus: 🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfu….
0
10
0
@sen_r
Senthooran Rajamanoharan
2 months
RT @EdTurner42: 1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity. ?!. We're releasi….
0
50
0
@sen_r
Senthooran Rajamanoharan
6 months
RT @ArthurConmy: In our new paper, we show that Chain-of-Thought reasoning is not always faithful in frontier thinking models!.We show this….
0
4
0
@sen_r
Senthooran Rajamanoharan
6 months
RT @JoshAEngels: 1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regim….
0
61
0
@sen_r
Senthooran Rajamanoharan
7 months
RT @javifer_96: New ICLR 2025 (Oral) paper🚨. Do LLMs know what they don’t know?.We observed internal mechanisms suggesting models recognize….
0
44
0
@sen_r
Senthooran Rajamanoharan
1 year
Check out this fantastic interactive demo by @neuronpedia to see what interesting features you find using Gemma Scope!
Tweet card summary image
neuronpedia.org
Exploring the Inner Workings of Gemma 2 2B
0
0
3
@sen_r
Senthooran Rajamanoharan
1 year
Today we're releasing Gemma Scope: hundreds of SAEs trained on every layer of Gemma 2 2B and 9B and select layers of 27B! Really excited to see how these SAEs help further research into how language models operate.
@NeelNanda5
Neel Nanda
1 year
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research. Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
2
0
20
@sen_r
Senthooran Rajamanoharan
1 year
RT @NeelNanda5: Great to see my team's JumpReLU Sparse Autoencoder paper covered in VentureBeat! Fantastic work from @sen_r.
0
2
0
@sen_r
Senthooran Rajamanoharan
1 year
You can apply now for early access!
0
0
2
@sen_r
Senthooran Rajamanoharan
1 year
Happy to see our new paper on JumpReLU SAEs featured on Daily Papers from @huggingface - and looking forward to releasing hundreds of open SAEs trained this way on Gemma 2 soon!.
@_akhaliq
AK
1 year
Google presents Jumping Ahead. Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be
Tweet media one
1
0
7
@sen_r
Senthooran Rajamanoharan
1 year
RT @NeelNanda5: New GDM mech interp paper led by @sen_r: JumpReLU SAEs a new SOTA SAE method! We replace standard ReLUs with discontinuous….
0
17
0
@sen_r
Senthooran Rajamanoharan
1 year
RT @AnthropicAI: New Anthropic research paper: Scaling Monosemanticity. The first ever detailed look inside a leading large language model….
0
551
0
@sen_r
Senthooran Rajamanoharan
1 year
RT @NeelNanda5: Announcing the first Mechanistic Interpretability workshop, held at ICML 2024! We have a fantastic speaker line-up @ch402 @….
0
60
0
@sen_r
Senthooran Rajamanoharan
1 year
Because we're tying the weights, it turns out Gated SAEs are equivalent to replacing ReLUs with discontinuous Jump ReLUs. With a suitable loss function we can get this to train well. We provide detailed training pseudo-code and explain why Jump ReLUs may be better in the appendix
Tweet media one
Tweet media two
1
1
15
@sen_r
Senthooran Rajamanoharan
1 year
But reconstruction isn't everything: our real goal is for SAEs to be a tool to find *interpretable* features! We do a double blind human study and find that they are comparably interpretable and possibly better; the CI for the difference is 0-13 percentage points (N=342).
Tweet media one
1
0
9
@sen_r
Senthooran Rajamanoharan
1 year
To show this works at scale, we train a range of SAEs (both normal and Gated) up to Gemma 7B on a range of layers and attention, MLP and residual activations, in the process showing it's practical to scale SAEs to 7B. Gated SAEs are consistently a Pareto improvement.
Tweet media one
1
2
11
@sen_r
Senthooran Rajamanoharan
1 year
Solution: Gated SAEs have two encoders, one to find which features are active, the other to estimate active features' magnitudes. The L1 penalty only applies to the first. This still works if you tie most of the weights of the two encoders, making this cheap to run.
Tweet media one
1
0
20