Lei Yu Profile
Lei Yu

@jade_lei_yu

Followers
53
Following
2
Media
9
Statuses
14

Joined July 2022
Don't wanna be here? Send us removal request.
@jade_lei_yu
Lei Yu
7 months
[6/6] Using the GSM-Symbolic dataset (derived from GSM8K templates with transformed conditions/results), we verified that additional intervention along LiReF directions boosts generalization performance on the reasoning tasks and lessens dependence on memory recall.
0
0
2
@jade_lei_yu
Lei Yu
7 months
[5/6] Furthermore, we found that by intervening model residual stream along the LiReF directions, we can intentionally modulate the switch between reasoning and memory recall mechanisms. This helps the model overcome issues such as insufficient memory recall or overthinking.
1
0
0
@jade_lei_yu
Lei Yu
7 months
[4/6] Through causal intervention, we find that activating LiReFs during LLM's inference significantly enhances model performance on reasoning tasks. Conversely, suppressing LiReFs encourages memory recall, thereby improving performance on memory-intensive tasks.
1
0
1
@jade_lei_yu
Lei Yu
7 months
[3/6] LiReF directions exhibit a gradient structure in representation space. Representations at extremes indicate pure Reasoning or Memory recall, respectively. Activation values near zero suggest the corresponding problem likely requires both memory and reasoning abilities.
1
0
2
@jade_lei_yu
Lei Yu
7 months
[2/6] We identify Linear Reasoning Features (LiReFs) in LLM residual stream that linearly separate Reasoning-vs-Memory task representations. Experiments (4 LLMs, 6 datasets) confirm LiReFs broadly explain and influence reasoning across languages, domains, and tasks.
1
0
2
@jade_lei_yu
Lei Yu
7 months
[1/6] LLMs excel on reasoning benchmarks. However, research shows they often generalize poorly to unseen problems, possibly due to over-reliance on memorized training data. Further, the mechanisms governing how/when LLMs switch between reasoning and memory recall remain unclear.
1
0
2
@jade_lei_yu
Lei Yu
7 months
📎 https://t.co/M1CdaqMWK9 💻 https://t.co/ovqPJ5kxxA Thanks to lead Yihuai Hong @YihuaiH91773, incoming PhD student @NYU_Courant, advisors Zhijin Jing @ZhijingJin & Lei Yu @jade_lei_yu (UToronto) and collaborators Meng Cao @Meng_0209 (McGill) & Dian Zhou (UIUC)
1
0
3
@jade_lei_yu
Lei Yu
7 months
🤔How do LLMs perform reasoning and recall memorized knowledge? How similar are their underlying mechanisms? We reveal their inherent distinction within LLMs' representations, and identify linear features that mediate model switch between genuine reasoning and memory recall.
3
3
12
@jade_lei_yu
Lei Yu
1 year
@littlefish3625 We believe this is a good example of how the analysis of internal mechanisms and representations of AI models can yield actionable insights and make models better and safer! Paper at: https://t.co/gBbEBCfB8D Work done with @gini_do @mahnerak and @nicola_cancedda at @AIatMeta
0
0
11
@jade_lei_yu
Lei Yu
1 year
@littlefish3625 3) We propose a new adversarial training method, ReFAT (Refusal Feature Adversarial Training) that is both more effective against adversarial attacks and much less computationally expensive than known alternatives
1
0
7
@jade_lei_yu
Lei Yu
1 year
@littlefish3625 2) The ablation of the refusal feature is a strong approximation to the worst-case adversarial degradation used in state-of-the-art adversarial training methods aimed at protecting against jailbreaking.
1
0
4
@jade_lei_yu
Lei Yu
1 year
@littlefish3625 1) The most effective jailbreaking methods all work by suppressing the representation along the refusal feature, to the point that they become ineffective when such feature is restored at inference time after the attack
1
0
6
@jade_lei_yu
Lei Yu
1 year
@littlefish3625 observed that the decision to refuse a request is mediated by a single direction, and highlighted a connection between adversarial attacks and refusal direction alterations: https://t.co/0mDRqaCoy0
@NeelNanda5
Neel Nanda
1 year
Great to see this on Arxiv! We show there's a *single* direction in LLMs that mediate whether they refuse, which allows a simple, interpretable jailbreak of open weight LLMs It's easy, surgical, and competitive with finetuning - a contender for a real world app of interp!
1
0
5
@jade_lei_yu
Lei Yu
1 year
New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"! There is a common mechanism behind LLM jailbreaking, and it can be leveraged to make models safer!
4
5
54