Seonglae Cho @SeonglaeC X Profile

Seonglae Cho

@SeonglaeC

Followers

44

Following

795

Media

45

Statuses

126

Mechanistic Interpretability | Holistic AI | UCL

https://t.co/5lcR8vWKNa

London, England

Joined February 2021

Don't wanna be here? Send us removal request.

Seonglae Cho

@SeonglaeC

3 months

🚀 New paper drop! We found that inference-time SAE features strongly correlate with correctness, enabling fully automated steering without manual tuning.

1

5

Seonglae Cho

@SeonglaeC

3 months

This is the paper link: https://t.co/HXYoIv5BH5 I’m open to Researcher positions in London (offline), feel free to reach out 📩

arxiv.org

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the...

0

Seonglae Cho

@SeonglaeC

3 months

We believe scalable, interpretable SAE steering can improve both performance & safety.

1

0

Seonglae Cho

@SeonglaeC

3 months

✅ On BBQ Ambig, CorrSteer changed only the 1,532 wrong answers to right while changing 0 correct ones untouched, minimizing the Side Effect Ratio (SER) and showing that representation-level steering could be safer than fine-tuning.

1

0

Seonglae Cho

@SeonglaeC

3 months

🧮 Very interestingly, math and programming features were selected as top correlated in almost every task, meaning math is important even in unexpected datasets. This indirectly supports DeepSeekMath, which showed that math-focused corpora can improve performance across diverse

1

0

Seonglae Cho

@SeonglaeC

3 months

⚖️ For the bias benchmark BBQ, unlike expectation, neutrality-focused features were revealed. Interestingly, features that looked too directly related appeared with negative correlation, suggesting that activation of meta-cognitive recognition features may hurt task performance.

1

0

Seonglae Cho

@SeonglaeC

3 months

Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. 🔒 For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.

1

0

Seonglae Cho

@SeonglaeC

3 months

We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering etc. 👉 The key insight: generation-time token correlation drives performance.

1

0

Seonglae Cho

@SeonglaeC

3 months

Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.

1

0

Seonglae Cho

@SeonglaeC

3 months

Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. 🔒 For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.

0

Seonglae Cho

@SeonglaeC

3 months

We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering … 👉 The key insight: generation-time token correlation drives performance.

1

0

1

Seonglae Cho

@SeonglaeC

3 months

Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.

1

0

Seonglae Cho

@SeonglaeC

4 months

Life is a POMDP (Partially observable Markov decision process), always facing uncertainty

0

Seonglae Cho

@SeonglaeC

4 months

Have to watch Squid Game, no money for X Premium 😥

0

Seonglae Cho

@SeonglaeC

4 months

In 2030, knowing how to code could be as niche as writing assembly is today

0

1

Seonglae Cho

@SeonglaeC

5 months

Check out the full paper and code! 📄 Paper: https://t.co/CUmqSzttKF 💻 Code:

github.com

Unlock the internal features of LLMs: FaithfulSAE trains sparse autoencoders only on synthetic data from the model itself, eliminating fake features and external data dependence for reproducible, r...

0

2

Seonglae Cho

@SeonglaeC

5 months

On top of that, we found that SAEs trained on this faithful synthetic dataset actually achieve better probing performance than those trained on web-based data. 🚀

1

0

1

Seonglae Cho

@SeonglaeC

5 months

What did we find? - Training with model-generated synthetic data improves feature stability across seeds - Removes dependency on external data - Reduces “fake features” (High Frequency SAE Latent)

1

0

1

Seonglae Cho

@SeonglaeC

5 months

And... my upcoming mechanistic interpretability thesis is going to be a big thing! Follow for updates! 👀✨

0

1

Seonglae Cho

@SeonglaeC

5 months

Why does this matter? Most previous SAE work just reused web or instruction datasets. But those can make SAE feature unstable across different seeds. ( https://t.co/IeVerhCV1a) We show that dataset choice is critical if you want reproducible, model-internal features.

1

0

1