Senthooran Rajamanoharan @sen_r X Profile

Senthooran Rajamanoharan

@sen_r

Followers

309

Following

12

Media

6

Statuses

28

London, UK

Joined April 2010

Don't wanna be here? Send us removal request.

Senthooran Rajamanoharan

@sen_r

1 month

RT @NeelNanda5: GDM interp work: Do LLMs have self-preservation?. Concerning recent work: models may block shutdown if it interferes with t….

0

21

0

Senthooran Rajamanoharan

@sen_r

2 months

RT @emmons_scott: Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When w….

0

40

0

Grok

@grok

14 hours

Join millions who have switched to Grok.

52

84

757

Senthooran Rajamanoharan

@sen_r

2 months

RT @IvanArcus: 🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfu….

0

10

0

Senthooran Rajamanoharan

@sen_r

2 months

RT @EdTurner42: 1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity. ?!. We're releasi….

0

50

0

Senthooran Rajamanoharan

@sen_r

6 months

RT @ArthurConmy: In our new paper, we show that Chain-of-Thought reasoning is not always faithful in frontier thinking models!.We show this….

0

4

0

Senthooran Rajamanoharan

@sen_r

6 months

RT @JoshAEngels: 1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regim….

0

61

0

Senthooran Rajamanoharan

@sen_r

7 months

RT @javifer_96: New ICLR 2025 (Oral) paper🚨. Do LLMs know what they don’t know?.We observed internal mechanisms suggesting models recognize….

0

44

0

Senthooran Rajamanoharan

@sen_r

1 year

Check out this fantastic interactive demo by @neuronpedia to see what interesting features you find using Gemma Scope!

neuronpedia.org

Exploring the Inner Workings of Gemma 2 2B

0

3

Senthooran Rajamanoharan

@sen_r

1 year

Today we're releasing Gemma Scope: hundreds of SAEs trained on every layer of Gemma 2 2B and 9B and select layers of 27B! Really excited to see how these SAEs help further research into how language models operate.

Neel Nanda

@NeelNanda5

1 year

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research. Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

2

0

20

Senthooran Rajamanoharan

@sen_r

1 year

RT @NeelNanda5: Great to see my team's JumpReLU Sparse Autoencoder paper covered in VentureBeat! Fantastic work from @sen_r.

0

2

0

Senthooran Rajamanoharan

@sen_r

1 year

You can apply now for early access!

0

2

Senthooran Rajamanoharan

@sen_r

1 year

Happy to see our new paper on JumpReLU SAEs featured on Daily Papers from @huggingface - and looking forward to releasing hundreds of open SAEs trained this way on Gemma 2 soon!.

AK

@_akhaliq

1 year

Google presents Jumping Ahead. Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be

1

0

7

Senthooran Rajamanoharan

@sen_r

1 year

RT @NeelNanda5: New GDM mech interp paper led by @sen_r: JumpReLU SAEs a new SOTA SAE method! We replace standard ReLUs with discontinuous….

0

17

0

Senthooran Rajamanoharan

@sen_r

1 year

RT @AnthropicAI: New Anthropic research paper: Scaling Monosemanticity. The first ever detailed look inside a leading large language model….

0

551

0

Senthooran Rajamanoharan

@sen_r

1 year

RT @NeelNanda5: Announcing the first Mechanistic Interpretability workshop, held at ICML 2024! We have a fantastic speaker line-up @ch402 @….

0

60

0

Senthooran Rajamanoharan

@sen_r

1 year

Check out the paper at Thanks to my great coauthors! @ArthurConmy Lewis Smith @lieberum_t @VikrantVarma_ @JanosKramar @rohinmshah @NeelNanda5.

arxiv.org

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse,...

1

15

Senthooran Rajamanoharan

@sen_r

1 year

Because we're tying the weights, it turns out Gated SAEs are equivalent to replacing ReLUs with discontinuous Jump ReLUs. With a suitable loss function we can get this to train well. We provide detailed training pseudo-code and explain why Jump ReLUs may be better in the appendix

1

15

Senthooran Rajamanoharan

@sen_r

1 year

But reconstruction isn't everything: our real goal is for SAEs to be a tool to find *interpretable* features! We do a double blind human study and find that they are comparably interpretable and possibly better; the CI for the difference is 0-13 percentage points (N=342).

1

0

9

Senthooran Rajamanoharan

@sen_r

1 year

To show this works at scale, we train a range of SAEs (both normal and Gated) up to Gemma 7B on a range of layers and attention, MLP and residual activations, in the process showing it's practical to scale SAEs to 7B. Gated SAEs are consistently a Pareto improvement.

1

2

11

Senthooran Rajamanoharan

@sen_r

1 year

Solution: Gated SAEs have two encoders, one to find which features are active, the other to estimate active features' magnitudes. The L1 penalty only applies to the first. This still works if you tie most of the weights of the two encoders, making this cheap to run.

1

0

20