Lei Yu
@jade_lei_yu
Followers
53
Following
2
Media
9
Statuses
14
Joined July 2022
[6/6] Using the GSM-Symbolic dataset (derived from GSM8K templates with transformed conditions/results), we verified that additional intervention along LiReF directions boosts generalization performance on the reasoning tasks and lessens dependence on memory recall.
0
0
2
[5/6] Furthermore, we found that by intervening model residual stream along the LiReF directions, we can intentionally modulate the switch between reasoning and memory recall mechanisms. This helps the model overcome issues such as insufficient memory recall or overthinking.
1
0
0
[4/6] Through causal intervention, we find that activating LiReFs during LLM's inference significantly enhances model performance on reasoning tasks. Conversely, suppressing LiReFs encourages memory recall, thereby improving performance on memory-intensive tasks.
1
0
1
[3/6] LiReF directions exhibit a gradient structure in representation space. Representations at extremes indicate pure Reasoning or Memory recall, respectively. Activation values near zero suggest the corresponding problem likely requires both memory and reasoning abilities.
1
0
2
[2/6] We identify Linear Reasoning Features (LiReFs) in LLM residual stream that linearly separate Reasoning-vs-Memory task representations. Experiments (4 LLMs, 6 datasets) confirm LiReFs broadly explain and influence reasoning across languages, domains, and tasks.
1
0
2
[1/6] LLMs excel on reasoning benchmarks. However, research shows they often generalize poorly to unseen problems, possibly due to over-reliance on memorized training data. Further, the mechanisms governing how/when LLMs switch between reasoning and memory recall remain unclear.
1
0
2
📎 https://t.co/M1CdaqMWK9 💻 https://t.co/ovqPJ5kxxA Thanks to lead Yihuai Hong @YihuaiH91773, incoming PhD student @NYU_Courant, advisors Zhijin Jing @ZhijingJin & Lei Yu @jade_lei_yu (UToronto) and collaborators Meng Cao @Meng_0209 (McGill) & Dian Zhou (UIUC)
1
0
3
🤔How do LLMs perform reasoning and recall memorized knowledge? How similar are their underlying mechanisms? We reveal their inherent distinction within LLMs' representations, and identify linear features that mediate model switch between genuine reasoning and memory recall.
3
3
12
@littlefish3625 We believe this is a good example of how the analysis of internal mechanisms and representations of AI models can yield actionable insights and make models better and safer! Paper at: https://t.co/gBbEBCfB8D Work done with @gini_do @mahnerak and @nicola_cancedda at @AIatMeta
0
0
11
@littlefish3625 3) We propose a new adversarial training method, ReFAT (Refusal Feature Adversarial Training) that is both more effective against adversarial attacks and much less computationally expensive than known alternatives
1
0
7
@littlefish3625 2) The ablation of the refusal feature is a strong approximation to the worst-case adversarial degradation used in state-of-the-art adversarial training methods aimed at protecting against jailbreaking.
1
0
4
@littlefish3625 1) The most effective jailbreaking methods all work by suppressing the representation along the refusal feature, to the point that they become ineffective when such feature is restored at inference time after the attack
1
0
6
@littlefish3625 observed that the decision to refuse a request is mediated by a single direction, and highlighted a connection between adversarial attacks and refusal direction alterations: https://t.co/0mDRqaCoy0
Great to see this on Arxiv! We show there's a *single* direction in LLMs that mediate whether they refuse, which allows a simple, interpretable jailbreak of open weight LLMs It's easy, surgical, and competitive with finetuning - a contender for a real world app of interp!
1
0
5
New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"! There is a common mechanism behind LLM jailbreaking, and it can be leveraged to make models safer!
4
5
54