Lei Yu @jade_lei_yu X Profile

Lei Yu

@jade_lei_yu

Followers

53

Following

2

Media

9

Statuses

14

Joined July 2022

Don't wanna be here? Send us removal request.

Lei Yu

@jade_lei_yu

7 months

[6/6] Using the GSM-Symbolic dataset (derived from GSM8K templates with transformed conditions/results), we verified that additional intervention along LiReF directions boosts generalization performance on the reasoning tasks and lessens dependence on memory recall.

0

2

Lei Yu

@jade_lei_yu

7 months

[5/6] Furthermore, we found that by intervening model residual stream along the LiReF directions, we can intentionally modulate the switch between reasoning and memory recall mechanisms. This helps the model overcome issues such as insufficient memory recall or overthinking.

1

0

Lei Yu

@jade_lei_yu

7 months

[4/6] Through causal intervention, we find that activating LiReFs during LLM's inference significantly enhances model performance on reasoning tasks. Conversely, suppressing LiReFs encourages memory recall, thereby improving performance on memory-intensive tasks.

1

0

1

Lei Yu

@jade_lei_yu

7 months

[3/6] LiReF directions exhibit a gradient structure in representation space. Representations at extremes indicate pure Reasoning or Memory recall, respectively. Activation values near zero suggest the corresponding problem likely requires both memory and reasoning abilities.

1

0

2

Lei Yu

@jade_lei_yu

7 months

[2/6] We identify Linear Reasoning Features (LiReFs) in LLM residual stream that linearly separate Reasoning-vs-Memory task representations. Experiments (4 LLMs, 6 datasets) confirm LiReFs broadly explain and influence reasoning across languages, domains, and tasks.

1

0

2

Lei Yu

@jade_lei_yu

7 months

[1/6] LLMs excel on reasoning benchmarks. However, research shows they often generalize poorly to unseen problems, possibly due to over-reliance on memorized training data. Further, the mechanisms governing how/when LLMs switch between reasoning and memory recall remain unclear.

1

0

2

Lei Yu

@jade_lei_yu

7 months

📎 https://t.co/M1CdaqMWK9 💻 https://t.co/ovqPJ5kxxA Thanks to lead Yihuai Hong @YihuaiH91773, incoming PhD student @NYU_Courant, advisors Zhijin Jing @ZhijingJin & Lei Yu @jade_lei_yu (UToronto) and collaborators Meng Cao @Meng_0209 (McGill) & Dian Zhou (UIUC)

1

0

3

Lei Yu

@jade_lei_yu

7 months

🤔How do LLMs perform reasoning and recall memorized knowledge? How similar are their underlying mechanisms? We reveal their inherent distinction within LLMs' representations, and identify linear features that mediate model switch between genuine reasoning and memory recall.

3

12

Lei Yu

@jade_lei_yu

1 year

@littlefish3625 We believe this is a good example of how the analysis of internal mechanisms and representations of AI models can yield actionable insights and make models better and safer! Paper at: https://t.co/gBbEBCfB8D Work done with @gini_do @mahnerak and @nicola_cancedda at @AIatMeta

0

11

Lei Yu

@jade_lei_yu

1 year

@littlefish3625 3) We propose a new adversarial training method, ReFAT (Refusal Feature Adversarial Training) that is both more effective against adversarial attacks and much less computationally expensive than known alternatives

1

0

7

Lei Yu

@jade_lei_yu

1 year

@littlefish3625 2) The ablation of the refusal feature is a strong approximation to the worst-case adversarial degradation used in state-of-the-art adversarial training methods aimed at protecting against jailbreaking.

1

0

4

Lei Yu

@jade_lei_yu

1 year

@littlefish3625 1) The most effective jailbreaking methods all work by suppressing the representation along the refusal feature, to the point that they become ineffective when such feature is restored at inference time after the attack

1

0

6

Lei Yu

@jade_lei_yu

1 year

@littlefish3625 observed that the decision to refuse a request is mediated by a single direction, and highlighted a connection between adversarial attacks and refusal direction alterations: https://t.co/0mDRqaCoy0

Neel Nanda

@NeelNanda5

1 year

Great to see this on Arxiv! We show there's a *single* direction in LLMs that mediate whether they refuse, which allows a simple, interpretable jailbreak of open weight LLMs It's easy, surgical, and competitive with finetuning - a contender for a real world app of interp!

1

0

5

Lei Yu

@jade_lei_yu

1 year

New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"! There is a common mechanism behind LLM jailbreaking, and it can be leveraged to make models safer!

4

5

54