mohsen_fayyaz Profile Banner
Mohsen Fayyaz Profile
Mohsen Fayyaz

@mohsen_fayyaz

Followers
273
Following
2K
Media
10
Statuses
34

CS PhD Student @ UCLA #NLProc #MachineLearning

Los Angeles, CA
Joined April 2018
Don't wanna be here? Send us removal request.
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐Ÿšจ You can bypass ALL safety guardrails of GPT-OSS-120B ๐Ÿšจโ—๐Ÿคฏ How? By detecting behavior-associated experts and switching them on/off. ๐Ÿ“„ Steering MoE LLMs via Expert (De)Activation ๐Ÿ”— https://t.co/U2YRyXon4H ๐Ÿงต๐Ÿ‘‡
5
24
130
@rohanpaul_ai
Rohan Paul
26 days
This paper shows Mixture of Experts (MoE) models share language-neutral experts in middle layers, and steering routers boosts multilingual reasoning. Means a tiny test-time change boosts many languages with almost no cost, by steering toward shared middle experts that predict
5
7
47
@LucasBandarkar
Lucas Bandarkar
1 month
Multilingual Routing in Mixture-of-Experts LLMs We present (1) an in-depth analysis of how MoE LLMs route multilingual texts, with very clear patterns + (2) a router intervention (steering) method that leads to consistent multilingual improvements! ๐Ÿงต1/4
1
9
26
@VioletNPeng
Violet Peng
2 months
One of my most exciting results lately! We identify experts in MoE models for properties like safety and faithfulness, and steer them to improve/hurt model faithfulness and safety. Most shockingly, with stearMoE, we can jailbreak 100% safety guardrails for open models. Details ๐Ÿ‘‡
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐Ÿšจ You can bypass ALL safety guardrails of GPT-OSS-120B ๐Ÿšจโ—๐Ÿคฏ How? By detecting behavior-associated experts and switching them on/off. ๐Ÿ“„ Steering MoE LLMs via Expert (De)Activation ๐Ÿ”— https://t.co/U2YRyXon4H ๐Ÿงต๐Ÿ‘‡
5
36
261
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐Ÿ“„ Steering MoE LLMs via Expert (De)Activation ๐Ÿ”— https://t.co/U2YRyXnPf9 ๐Ÿ’ป https://t.co/xJAIgfOQ2G This work was my internship project at @AdobeResearch, with an amazing team: @AModarressi, @haniehsalehy, @f_dernoncourt, Ryan Rossi, @bhtrung, @HinrichSchuetze, and @VioletNPeng
Tweet card summary image
github.com
A framework for steering MoE models by detecting and controlling behavior-linked experts. - adobe-research/SteerMoE
0
1
19
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐Ÿ’ก TL;DR: 1๏ธโƒฃ MoE โ€œexpertsโ€ donโ€™t just handle vocab or domains, they encode behaviors (safety, faithfulness, ...). 2๏ธโƒฃ Flip them on/off at inference to steer the model. 3๏ธโƒฃ SteerMoE exposes a new dimension of safety alignment faking hidden within experts.
1
1
14
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐Ÿ”ฅ But hereโ€™s the twist: Jailbreak prompts often get blocked by newer guardrails. But if you disable safety-linked expertsโ€ฆ ๐Ÿ’ฅ Attack Success Rate hits ๐Ÿ’ฏ% ๐Ÿ’ฅ โ†’ Safety post-training only aligns a small subnetwork of the model, leaving alternate paths unsafe. (Alignment Faking)
1
2
14
@mohsen_fayyaz
Mohsen Fayyaz
2 months
๐ŸŽฏ Want to reduce hallucinations in RAG? โžก๏ธ Steer toward retrieved document faithfulness. ๐Ÿ›ก๏ธ Want safer outputs? โžก๏ธ Steer toward safety-linked experts.
1
1
16
@mohsen_fayyaz
Mohsen Fayyaz
2 months
Our method is simple: โš–๏ธ Compare expert activations between paired inputs (e.g., safe vs unsafe completions) ๐Ÿ“Š Measure activation differences โœ… Use that to steer behavior at test time by routing through or around key experts.
1
1
14
@mohsen_fayyaz
Mohsen Fayyaz
2 months
Modern MoE (Mixture-of-Experts) LLMs (e.g., Qwen3, DeepSeek, GPT-OSS) activate a small subset of expert subnetworks per token. But what if we could control which ones get activated? What if we could steer the modelโ€ฆ at test time? ๐Ÿงญ
1
3
23
@VioletNPeng
Violet Peng
6 months
@mohsen_fayyaz's recent work showed several critical issues of dense retrievers favoring spurious correlations over knowledge, which makes RAG particularly vulnerable to adversarial examples. Check out more details ๐Ÿ‘‡
@mohsen_fayyaz
Mohsen Fayyaz
6 months
Now accepted to #ACL2025 main conference! ๐ŸŽ‰
0
2
7
@mohsen_fayyaz
Mohsen Fayyaz
6 months
Now accepted to #ACL2025 main conference! ๐ŸŽ‰
@mohsen_fayyaz
Mohsen Fayyaz
8 months
new paper! ๐ŸŒฑ Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: ๐Ÿ“Œ Shorter docs ๐Ÿ“Œ Early positions ๐Ÿ“Œ Repeated entities ๐Ÿ“Œ Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P
2
6
28
@rohanpaul_ai
Rohan Paul
8 months
Dense retrieval models in Retrieval Augmented Generation systems often prioritize superficial document features, overlooking actual answer relevance. This inefficiency arises from biases in retrievers. This paper addresses this by using controlled experiments based on Re-DocRED
0
6
16
@mohsen_fayyaz
Mohsen Fayyaz
8 months
the takeaway? we need robust retrievers that prioritize answer relevance, not just heuristic shortcuts. work with an amazing team: @AModarressi, @HinrichSchuetze, @VioletNPeng paper: https://t.co/D9mVT22Pgj dataset:
Tweet card summary image
huggingface.co
0
0
4
@mohsen_fayyaz
Mohsen Fayyaz
8 months
we also analyze RAG: biased retrievers can mislead LLMs, degrading their performance by 34%โ€”worse than retrieving nothing! ๐Ÿ˜ฎ
1
0
4
@mohsen_fayyaz
Mohsen Fayyaz
8 months
when multiple biases combine, retrievers fail catastrophically: ๐Ÿ“‰ Answer-containing docs ranked <3% of the time over a synthetic biased doc with no answer!
1
0
4
@mohsen_fayyaz
Mohsen Fayyaz
8 months
dense retrievers are crucial for RAG and search, but do they actually retrieve useful evidence? ๐Ÿค” we design controlled experiments by repurposing a relation extraction dataset, exposing serious flaws in models like Dragon+ and Contriever.
1
0
3
@mohsen_fayyaz
Mohsen Fayyaz
8 months
new paper! ๐ŸŒฑ Collapse of Dense Retrievers We uncover major vulnerabilities in dense retrievers like Contriever, showing they favor: ๐Ÿ“Œ Shorter docs ๐Ÿ“Œ Early positions ๐Ÿ“Œ Repeated entities ๐Ÿ“Œ Literal matches ...all while ignoring the answer's presence! https://t.co/QZFyCLqP0P
Tweet card summary image
huggingface.co
2
5
41
@gordonhu608
Wenbo Hu
10 months
Excited to share MRAG-Bench is accepted at #ICLR2025 ๐Ÿ‡ธ๐Ÿ‡ฌ. The image corpus is a rich source of information, and extracting knowledge from it can often be more advantageous than from a text corpus. We study how MLLMs can utilize vision-centric multimodal knowledge. More in our
@gordonhu608
Wenbo Hu
1 year
๐Ÿš€Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? ๐Ÿค”Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.๐Ÿ’ฅWe focus on scenarios where retrieving knowledge from image corpus is more
0
3
33
@gordonhu608
Wenbo Hu
1 year
๐Ÿš€Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? ๐Ÿค”Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.๐Ÿ’ฅWe focus on scenarios where retrieving knowledge from image corpus is more
4
31
98