Fantastic work from @sen_r and @ArthurConmy - done in an impressive 2 week paper sprint! Gated SAEs are a new sparse autoencoder architecture that seem a major Pareto improvement. This is now my team's preferred way to train SAEs, and I hope it'll accelerate the community's work! Tweet added by Neel Nanda @NeelNanda5

Neel Nanda

2 months

Fantastic work from @sen_r and @ArthurConmy - done in an impressive 2 week paper sprint! Gated SAEs are a new sparse autoencoder architecture that seem a major Pareto improvement. This is now my team's preferred way to train SAEs, and I hope it'll accelerate the community's work!

Senthooran Rajamanoharan

@sen_r

2 months

New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy

5

26

164

1

10

77

Neel Nanda

@NeelNanda5

2 months

In the process, we also scaled both kinds of SAEs to Gemma 7B, showing SAEs work on non-toy models! Gated SAEs remain an interpretable Pareto improvement. As a side effect, we showed that SAEs work on Gemma's Gated MLPs. That is: Gated SAEs still beat normal SAEs on Gated MLPs!

1

6

Neel Nanda

@NeelNanda5

2 months

Two elegant key ideas: * SAEs use L1 to incentivise sparsity, but it distorts by penalising magnitude. Why not just apply it to finding *which* features fire, not how much? * The resulting architecture boils down to a bunch of training tricks + replacing ReLUs w/ Jump ReLUs.

1

0

4

Neel Nanda

@NeelNanda5

2 months

Why are Jump ReLUs better? A toy model: * The SAE wants to map the blue histogram to zero, and the red to itself * Geometrically, a ReLU sets a threshold, below which is zero, above which is dist to the threshold * t=1 distorts red by changing distance, t=0 keeps too much blue.

1

0

2

Neel Nanda

@NeelNanda5

2 months

Jump ReLUs solve this! Geometrically, they set *two* lines, below which x gets mapped to zero, and above which we return distance to the *second* line. So "if above 1, return dist to 0" is a great solution to the problem! This lets SAEs ignore interference without distortion.

1

0

3

Neel Nanda

@NeelNanda5

2 months

But Jump ReLUs & avoiding shrinkage aren't enough to explain why Gated SAEs are so great! We fine-tune a frozen normal SAE to have Jump ReLUs, solving its shrinkage, but underperforming Gated SAEs! We speculate that Jump ReLUs need better enc dirs to exploit the new expressivity

1

0

1

Neel Nanda

@NeelNanda5

2 months

A key concern in dictionary learning + interp work is that your learning method will be *too* powerful, and learn a great sparse reconstruction that's too complex for the actual LLM to decode or use, and so is not faithful to the model's computation, and is likely uninterpretable

1

0

2

Neel Nanda

@NeelNanda5

2 months

I care a lot about avoiding this trap! This is one of the reasons the field has somewhat converged on SAEs, not more powerful methods. And real models don't have discontinuities like a Jump ReLU. So are Gated SAEs BS?

1

0

1

Neel Nanda

@NeelNanda5

2 months

I don't think so! We can't yet prove they (or normal SAEs!) are faithful to the model's computation, but they were between comparable and more interpretable in a double blind human study, which is evidence. I'm excited for future work on if SAEs find causally meaningful features.

1

0

2

Neel Nanda

@NeelNanda5

2 months

An upset: Just before release, we got the final interp results, and *just* dropped below statistical significance for Gated SAEs being more interpretable. In my heart of hearts, I am a Bayesian, and I'd bet that they're better, but make of our results what you will (n=342)

1

0

4

Neel Nanda

@NeelNanda5

2 months

Check out the paper here!

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse,...

arxiv.org

0

6

Replies