Seonglae Cho
@SeonglaeC
Followers
44
Following
795
Media
45
Statuses
126
Mechanistic Interpretability | Holistic AI | UCL
London, England
Joined February 2021
๐ New paper drop! We found that inference-time SAE features strongly correlate with correctness, enabling fully automated steering without manual tuning.
1
1
5
This is the paper link: https://t.co/HXYoIv5BH5 Iโm open to Researcher positions in London (offline), feel free to reach out ๐ฉ
arxiv.org
Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the...
0
0
0
We believe scalable, interpretable SAE steering can improve both performance & safety.
1
0
0
โ
On BBQ Ambig, CorrSteer changed only the 1,532 wrong answers to right while changing 0 correct ones untouched, minimizing the Side Effect Ratio (SER) and showing that representation-level steering could be safer than fine-tuning.
1
0
0
๐งฎ Very interestingly, math and programming features were selected as top correlated in almost every task, meaning math is important even in unexpected datasets. This indirectly supports DeepSeekMath, which showed that math-focused corpora can improve performance across diverse
1
0
0
โ๏ธ For the bias benchmark BBQ, unlike expectation, neutrality-focused features were revealed. Interestingly, features that looked too directly related appeared with negative correlation, suggesting that activation of meta-cognitive recognition features may hurt task performance.
1
0
0
Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. ๐ For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.
1
0
0
We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering etc. ๐ The key insight: generation-time token correlation drives performance.
1
0
0
Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.
1
0
0
Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. ๐ For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.
0
0
0
We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering โฆ ๐ The key insight: generation-time token correlation drives performance.
1
0
1
Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.
1
0
0
Life is a POMDP (Partially observable Markov decision process), always facing uncertainty
0
0
0
In 2030, knowing how to code could be as niche as writing assembly is today
0
0
1
Check out the full paper and code! ๐ Paper: https://t.co/CUmqSzttKF ๐ป Code:
github.com
Unlock the internal features of LLMs: FaithfulSAE trains sparse autoencoders only on synthetic data from the model itself, eliminating fake features and external data dependence for reproducible, r...
0
0
2
On top of that, we found that SAEs trained on this faithful synthetic dataset actually achieve better probing performance than those trained on web-based data. ๐
1
0
1
What did we find? - Training with model-generated synthetic data improves feature stability across seeds - Removes dependency on external data - Reduces โfake featuresโ (High Frequency SAE Latent)
1
0
1
And... my upcoming mechanistic interpretability thesis is going to be a big thing! Follow for updates! ๐โจ
0
0
1
Why does this matter? Most previous SAE work just reused web or instruction datasets. But those can make SAE feature unstable across different seeds. ( https://t.co/IeVerhCV1a) We show that dataset choice is critical if you want reproducible, model-internal features.
1
0
1