SeonglaeC Profile Banner
Seonglae Cho Profile
Seonglae Cho

@SeonglaeC

Followers
44
Following
795
Media
45
Statuses
126

Mechanistic Interpretability | Holistic AI | UCL

London, England
Joined February 2021
Don't wanna be here? Send us removal request.
@SeonglaeC
Seonglae Cho
3 months
๐Ÿš€ New paper drop! We found that inference-time SAE features strongly correlate with correctness, enabling fully automated steering without manual tuning.
1
1
5
@SeonglaeC
Seonglae Cho
3 months
We believe scalable, interpretable SAE steering can improve both performance & safety.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
โœ… On BBQ Ambig, CorrSteer changed only the 1,532 wrong answers to right while changing 0 correct ones untouched, minimizing the Side Effect Ratio (SER) and showing that representation-level steering could be safer than fine-tuning.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
๐Ÿงฎ Very interestingly, math and programming features were selected as top correlated in almost every task, meaning math is important even in unexpected datasets. This indirectly supports DeepSeekMath, which showed that math-focused corpora can improve performance across diverse
1
0
0
@SeonglaeC
Seonglae Cho
3 months
โš–๏ธ For the bias benchmark BBQ, unlike expectation, neutrality-focused features were revealed. Interestingly, features that looked too directly related appeared with negative correlation, suggesting that activation of meta-cognitive recognition features may hurt task performance.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. ๐Ÿ”’ For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering etc. ๐Ÿ‘‰ The key insight: generation-time token correlation drives performance.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.
1
0
0
@SeonglaeC
Seonglae Cho
3 months
Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying capabilities that drive task performance. ๐Ÿ”’ For example, on HarmBench, in the LLaMA 8B model, safety-related features were extracted in most layers.
0
0
0
@SeonglaeC
Seonglae Cho
3 months
We ran extensive ablations: - Generation-token vs all-token pooling - Raw activation vs SAE activation - Mean vs Max strategies - Multi-layer vs single-layer steering โ€ฆ ๐Ÿ‘‰ The key insight: generation-time token correlation drives performance.
1
0
1
@SeonglaeC
Seonglae Cho
3 months
Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.
1
0
0
@SeonglaeC
Seonglae Cho
4 months
Life is a POMDP (Partially observable Markov decision process), always facing uncertainty
0
0
0
@SeonglaeC
Seonglae Cho
4 months
Have to watch Squid Game, no money for X Premium ๐Ÿ˜ฅ
0
0
0
@SeonglaeC
Seonglae Cho
4 months
In 2030, knowing how to code could be as niche as writing assembly is today
0
0
1
@SeonglaeC
Seonglae Cho
5 months
On top of that, we found that SAEs trained on this faithful synthetic dataset actually achieve better probing performance than those trained on web-based data. ๐Ÿš€
1
0
1
@SeonglaeC
Seonglae Cho
5 months
What did we find? - Training with model-generated synthetic data improves feature stability across seeds - Removes dependency on external data - Reduces โ€œfake featuresโ€ (High Frequency SAE Latent)
1
0
1
@SeonglaeC
Seonglae Cho
5 months
And... my upcoming mechanistic interpretability thesis is going to be a big thing! Follow for updates! ๐Ÿ‘€โœจ
0
0
1
@SeonglaeC
Seonglae Cho
5 months
Why does this matter? Most previous SAE work just reused web or instruction datasets. But those can make SAE feature unstable across different seeds. ( https://t.co/IeVerhCV1a) We show that dataset choice is critical if you want reproducible, model-internal features.
1
0
1