
Simon Shaolei Du
@SimonShaoleiDu
Followers
8K
Following
6K
Media
14
Statuses
516
Assistant Professor @uwcse. Postdoc @the_IAS. PhD in machine learning @mldcmu.
Seattle, WA
Joined September 2017
RT @pareshrc: 1/6 Current AI agent training methods fail to capture diverse behaviors needed for human-AI cooperation. GOAT (Generative Onl….
0
7
0
RT @avibose22: 🚨 Code is live! Check out LoRe – a modular, lightweight codebase for personalized reward modeling from user preferences. 📦 F….
0
6
0
Check out our new work using online multi-agent RL for LM safety.
🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat.🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
1
2
20
RT @uwcse: Congratulations to @UW #UWAllen Ph.D. grads @sharma_ashish_2 & @sewon__min, @TheOfficialACM Doctoral Dissertation Award honorees….
0
19
0
PPO vs. DPO? 🤔.Our new paper proves that it depends on whether your models can represent the optimal policy and/or reward. Paper: Led by @smellycat_ZZZ @MinhakSong.
Two-stage RLHF or one-stage DPO: Which one is better for learning from preferences?. Equal under strong assumptions, but representation differences break the tie. Our paper reveals their fine-grained performance gaps under various conditions. paper:
0
18
97
Our new paper tries to uncover what we really need in applying RLVR.
🤯 We cracked RLVR with. Random Rewards?!.Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:.- Random rewards: +21%.- Incorrect rewards: +25%.- (FYI) Ground-truth rewards: + 28.8%.How could this even work⁉️ Here's why: 🧵.Blogpost:
0
0
19
Even with the same vision encoder, generative VLMs (LLaVA) can extract more information than CLIP. Why? Check out our #ACL2025NLP paper led by @SitingLi627 :
Excited to share that our paper "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" is accepted to #ACL2025! .Preprint: Thank @SimonShaoleiDu and @PangWeiKoh so much for your support and guidance throughout the journey!.
1
2
17
RT @shaneguML: Famous LLM researcher Bruce Lee quote: "I fear not the LLM who has practiced 10,000 questions once, but I fear the LLM who h….
0
87
0
Excited to share our work led by @ypwang61 .RLVR with only ONE training example can boost 37% accuracy on MATH500.
We only need ONE example for RLVR on LLMs to achieve significant improvement on math tasks!. 📍RLVR with one training example can boost:. - Qwen2.5-Math-1.5B: 36.0% → 73.6%. - Qwen2.5-Math-7B: 51.0% → 79.2% . on MATH500. 📄 Paper:
2
6
49
RT @avibose22: 🧠 Your LLM should model how you think, not reduce you to preassigned traits.📢 Introducing LoRe: a low-rank reward modeling f….
0
26
0
Sampler is crucial for faster convergence of online DPO! Check out out paper: #ICLR2025.
Previous works study the sample complexity of DPO and emphasize the role of samplers in online DPO. What about its role in optimization convergence rates?. Check out our paper at #ICLR2025 on convergence rates of online DPO with various samplers!. ArXiv:
0
3
23
Excited to share our new work led by @kjha02 : scaling training to more diverse environments is key to human-AI cooperation!.
Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. �
0
0
16
RT @VectorZhou: 🧠 Ever notice how LLMs struggle with familiar knowledge in unfamiliar formats? Our new paper "CASCADE Your Datasets for Cro….
0
8
0
RT @jxwuyi: 🎉 Milestone Release! AReaL-boba, our latest #RL system! #AI.• data/code/model ALL🔥 #OPENSOURCE.• Full #….
0
39
0