Mohsen Fayyaz
@mohsen_fayyaz
Followers
273
Following
2K
Media
10
Statuses
34
CS PhD Student @ UCLA #NLProc #MachineLearning
Los Angeles, CA
Joined April 2018
๐จ You can bypass ALL safety guardrails of GPT-OSS-120B ๐จโ๐คฏ How? By detecting behavior-associated experts and switching them on/off. ๐ Steering MoE LLMs via Expert (De)Activation ๐ https://t.co/U2YRyXon4H ๐งต๐
5
24
130
This paper shows Mixture of Experts (MoE) models share language-neutral experts in middle layers, and steering routers boosts multilingual reasoning. Means a tiny test-time change boosts many languages with almost no cost, by steering toward shared middle experts that predict
5
7
47
Multilingual Routing in Mixture-of-Experts LLMs We present (1) an in-depth analysis of how MoE LLMs route multilingual texts, with very clear patterns + (2) a router intervention (steering) method that leads to consistent multilingual improvements! ๐งต1/4
1
9
26
One of my most exciting results lately! We identify experts in MoE models for properties like safety and faithfulness, and steer them to improve/hurt model faithfulness and safety. Most shockingly, with stearMoE, we can jailbreak 100% safety guardrails for open models. Details ๐
๐จ You can bypass ALL safety guardrails of GPT-OSS-120B ๐จโ๐คฏ How? By detecting behavior-associated experts and switching them on/off. ๐ Steering MoE LLMs via Expert (De)Activation ๐ https://t.co/U2YRyXon4H ๐งต๐
5
36
261
๐ Steering MoE LLMs via Expert (De)Activation ๐ https://t.co/U2YRyXnPf9 ๐ป https://t.co/xJAIgfOQ2G This work was my internship project at @AdobeResearch, with an amazing team: @AModarressi, @haniehsalehy, @f_dernoncourt, Ryan Rossi, @bhtrung, @HinrichSchuetze, and @VioletNPeng
github.com
A framework for steering MoE models by detecting and controlling behavior-linked experts. - adobe-research/SteerMoE
0
1
19
๐ก TL;DR: 1๏ธโฃ MoE โexpertsโ donโt just handle vocab or domains, they encode behaviors (safety, faithfulness, ...). 2๏ธโฃ Flip them on/off at inference to steer the model. 3๏ธโฃ SteerMoE exposes a new dimension of safety alignment faking hidden within experts.
1
1
14
๐ฅ But hereโs the twist: Jailbreak prompts often get blocked by newer guardrails. But if you disable safety-linked expertsโฆ ๐ฅ Attack Success Rate hits ๐ฏ% ๐ฅ โ Safety post-training only aligns a small subnetwork of the model, leaving alternate paths unsafe. (Alignment Faking)
1
2
14
๐ฏ Want to reduce hallucinations in RAG? โก๏ธ Steer toward retrieved document faithfulness. ๐ก๏ธ Want safer outputs? โก๏ธ Steer toward safety-linked experts.
1
1
16
Our method is simple: โ๏ธ Compare expert activations between paired inputs (e.g., safe vs unsafe completions) ๐ Measure activation differences โ
Use that to steer behavior at test time by routing through or around key experts.
1
1
14
Modern MoE (Mixture-of-Experts) LLMs (e.g., Qwen3, DeepSeek, GPT-OSS) activate a small subset of expert subnetworks per token. But what if we could control which ones get activated? What if we could steer the modelโฆ at test time? ๐งญ
1
3
23
@mohsen_fayyaz's recent work showed several critical issues of dense retrievers favoring spurious correlations over knowledge, which makes RAG particularly vulnerable to adversarial examples. Check out more details ๐
Now accepted to #ACL2025 main conference! ๐
0
2
7
Now accepted to #ACL2025 main conference! ๐
new paper! ๐ฑ Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: ๐ Shorter docs ๐ Early positions ๐ Repeated entities ๐ Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P
2
6
28
Dense retrieval models in Retrieval Augmented Generation systems often prioritize superficial document features, overlooking actual answer relevance. This inefficiency arises from biases in retrievers. This paper addresses this by using controlled experiments based on Re-DocRED
0
6
16
the takeaway? we need robust retrievers that prioritize answer relevance, not just heuristic shortcuts. work with an amazing team: @AModarressi, @HinrichSchuetze, @VioletNPeng paper: https://t.co/D9mVT22Pgj dataset:
huggingface.co
0
0
4
we also analyze RAG: biased retrievers can mislead LLMs, degrading their performance by 34%โworse than retrieving nothing! ๐ฎ
1
0
4
when multiple biases combine, retrievers fail catastrophically: ๐ Answer-containing docs ranked <3% of the time over a synthetic biased doc with no answer!
1
0
4
dense retrievers are crucial for RAG and search, but do they actually retrieve useful evidence? ๐ค we design controlled experiments by repurposing a relation extraction dataset, exposing serious flaws in models like Dragon+ and Contriever.
1
0
3
new paper! ๐ฑ Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: ๐ Shorter docs ๐ Early positions ๐ Repeated entities ๐ Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P
huggingface.co
2
5
41
Excited to share MRAG-Bench is accepted at #ICLR2025 ๐ธ๐ฌ. The image corpus is a rich source of information, and extracting knowledge from it can often be more advantageous than from a text corpus. We study how MLLMs can utilize vision-centric multimodal knowledge. More in our
๐Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? ๐คPrevious multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.๐ฅWe focus on scenarios where retrieving knowledge from image corpus is more
0
3
33
๐Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? ๐คPrevious multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.๐ฅWe focus on scenarios where retrieving knowledge from image corpus is more
4
31
98