Mohsen Fayyaz @mohsen_fayyaz X Profile

Mohsen Fayyaz

@mohsen_fayyaz

Followers

273

Following

2K

Media

10

Statuses

34

CS PhD Student @ UCLA #NLProc #MachineLearning

https://t.co/LujJnB4BlO

Los Angeles, CA

Joined April 2018

Don't wanna be here? Send us removal request.

Mohsen Fayyaz

@mohsen_fayyaz

2 months

🚨 You can bypass ALL safety guardrails of GPT-OSS-120B 🚨❗🤯 How? By detecting behavior-associated experts and switching them on/off. 📄 Steering MoE LLMs via Expert (De)Activation 🔗 https://t.co/U2YRyXon4H 🧵👇

5

24

130

Rohan Paul

@rohanpaul_ai

26 days

This paper shows Mixture of Experts (MoE) models share language-neutral experts in middle layers, and steering routers boosts multilingual reasoning. Means a tiny test-time change boosts many languages with almost no cost, by steering toward shared middle experts that predict

5

7

47

Lucas Bandarkar

@LucasBandarkar

1 month

Multilingual Routing in Mixture-of-Experts LLMs We present (1) an in-depth analysis of how MoE LLMs route multilingual texts, with very clear patterns + (2) a router intervention (steering) method that leads to consistent multilingual improvements! 🧵1/4

1

9

26

Violet Peng

@VioletNPeng

2 months

One of my most exciting results lately! We identify experts in MoE models for properties like safety and faithfulness, and steer them to improve/hurt model faithfulness and safety. Most shockingly, with stearMoE, we can jailbreak 100% safety guardrails for open models. Details 👇

Mohsen Fayyaz

@mohsen_fayyaz

2 months

🚨 You can bypass ALL safety guardrails of GPT-OSS-120B 🚨❗🤯 How? By detecting behavior-associated experts and switching them on/off. 📄 Steering MoE LLMs via Expert (De)Activation 🔗 https://t.co/U2YRyXon4H 🧵👇

5

36

261

Mohsen Fayyaz

@mohsen_fayyaz

2 months

📄 Steering MoE LLMs via Expert (De)Activation 🔗 https://t.co/U2YRyXnPf9 💻 https://t.co/xJAIgfOQ2G This work was my internship project at @AdobeResearch, with an amazing team: @AModarressi, @haniehsalehy, @f_dernoncourt, Ryan Rossi, @bhtrung, @HinrichSchuetze, and @VioletNPeng

github.com

A framework for steering MoE models by detecting and controlling behavior-linked experts. - adobe-research/SteerMoE

0

1

19

Mohsen Fayyaz

@mohsen_fayyaz

2 months

💡 TL;DR: 1️⃣ MoE “experts” don’t just handle vocab or domains, they encode behaviors (safety, faithfulness, ...). 2️⃣ Flip them on/off at inference to steer the model. 3️⃣ SteerMoE exposes a new dimension of safety alignment faking hidden within experts.

1

14

Mohsen Fayyaz

@mohsen_fayyaz

2 months

🔥 But here’s the twist: Jailbreak prompts often get blocked by newer guardrails. But if you disable safety-linked experts… 💥 Attack Success Rate hits 💯% 💥 → Safety post-training only aligns a small subnetwork of the model, leaving alternate paths unsafe. (Alignment Faking)

1

2

14

Mohsen Fayyaz

@mohsen_fayyaz

2 months

🎯 Want to reduce hallucinations in RAG? ➡️ Steer toward retrieved document faithfulness. 🛡️ Want safer outputs? ➡️ Steer toward safety-linked experts.

1

16

Mohsen Fayyaz

@mohsen_fayyaz

2 months

Our method is simple: ⚖️ Compare expert activations between paired inputs (e.g., safe vs unsafe completions) 📊 Measure activation differences ✅ Use that to steer behavior at test time by routing through or around key experts.

1

14

Mohsen Fayyaz

@mohsen_fayyaz

2 months

Modern MoE (Mixture-of-Experts) LLMs (e.g., Qwen3, DeepSeek, GPT-OSS) activate a small subset of expert subnetworks per token. But what if we could control which ones get activated? What if we could steer the model… at test time? 🧭

1

3

23

Violet Peng

@VioletNPeng

6 months

@mohsen_fayyaz's recent work showed several critical issues of dense retrievers favoring spurious correlations over knowledge, which makes RAG particularly vulnerable to adversarial examples. Check out more details 👇

Mohsen Fayyaz

@mohsen_fayyaz

6 months

Now accepted to #ACL2025 main conference! 🎉

0

2

7

Mohsen Fayyaz

@mohsen_fayyaz

6 months

Now accepted to #ACL2025 main conference! 🎉

Mohsen Fayyaz

@mohsen_fayyaz

8 months

new paper! 🌱 Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: 📌 Shorter docs 📌 Early positions 📌 Repeated entities 📌 Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P

2

6

28

Rohan Paul

@rohanpaul_ai

8 months

Dense retrieval models in Retrieval Augmented Generation systems often prioritize superficial document features, overlooking actual answer relevance. This inefficiency arises from biases in retrievers. This paper addresses this by using controlled experiments based on Re-DocRED

0

6

16

Mohsen Fayyaz

@mohsen_fayyaz

8 months

the takeaway? we need robust retrievers that prioritize answer relevance, not just heuristic shortcuts. work with an amazing team: @AModarressi, @HinrichSchuetze, @VioletNPeng paper: https://t.co/D9mVT22Pgj dataset:

huggingface.co

0

4

Mohsen Fayyaz

@mohsen_fayyaz

8 months

we also analyze RAG: biased retrievers can mislead LLMs, degrading their performance by 34%—worse than retrieving nothing! 😮

1

0

4

Mohsen Fayyaz

@mohsen_fayyaz

8 months

when multiple biases combine, retrievers fail catastrophically: 📉 Answer-containing docs ranked <3% of the time over a synthetic biased doc with no answer!

1

0

4

Mohsen Fayyaz

@mohsen_fayyaz

8 months

dense retrievers are crucial for RAG and search, but do they actually retrieve useful evidence? 🤔 we design controlled experiments by repurposing a relation extraction dataset, exposing serious flaws in models like Dragon+ and Contriever.

1

0

3

Mohsen Fayyaz

@mohsen_fayyaz

8 months

new paper! 🌱 Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: 📌 Shorter docs 📌 Early positions 📌 Repeated entities 📌 Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P

huggingface.co

2

5

41

Wenbo Hu

@gordonhu608

10 months

Excited to share MRAG-Bench is accepted at #ICLR2025 🇸🇬. The image corpus is a rich source of information, and extracting knowledge from it can often be more advantageous than from a text corpus. We study how MLLMs can utilize vision-centric multimodal knowledge. More in our

Wenbo Hu

@gordonhu608

1 year

🚀Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? 🤔Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.💥We focus on scenarios where retrieving knowledge from image corpus is more

0

3

33

Wenbo Hu

@gordonhu608

1 year

🚀Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? 🤔Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.💥We focus on scenarios where retrieving knowledge from image corpus is more

4

31

98