Wei Xiong
@weixiong_1
Followers
1K
Following
661
Media
16
Statuses
179
Statistical learning theory, Post-training of LLMs PhD Student @IllinoisCS, prev @Meta @GoogleDeepMind @MSFTResearch @USTC
Joined February 2024
keep sampling the prompts until you get enough learning signals! robust adaptive sampling framework for GRPO-like training
💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3
https://t.co/7qLywG2KWR
0
0
14
😃 thanks, google!
🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: https://t.co/0Pvuv6hsgP
12
6
101
🌀Agent Learning via Early Experience🌀 📝: https://t.co/VsqQHTTrBN - SFT for agents is sparse; RL on long-horizons is hard We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states - Strong improvements over
3
43
190
Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and
4
19
93
Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In
1
32
50
💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3
https://t.co/7qLywG2KWR
9
24
181
PROF🌀Right answer, flawed reason?🤔🌀 📄 https://t.co/8kFrxKQbVW Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning
2
11
37
🪜Introducing: StepWiser🦉 📝: https://t.co/RXOjaMjHI1 - Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment. - Trained by RL using relative outcomes of rollouts. Results: (1) SOTA performance on ProcessBench! (2) Improves policy at train time. (3)
11
96
482
🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California
8
57
239
Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7
1
8
26
We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs. – Achieves 2–4× faster convergence than RAFT – Improves accuracy on math
0
28
89
Surprised by the small performance gap between RAFT and Reinforce/GRPO. We may need more fine-grained negative signals to better guide learning.🧐
🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️
2
8
93
🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️
arxiv.org
Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical...
9
101
466
Our NAACL 2025 findings paper demonstrates that securing AI agents requires more than off-the-shelf defenses—adaptive attacks continuously evolve to bypass them. If you have any questions or want to discuss more, feel free to reach out!
AI agents are increasingly popular (e.g., OpenAI's operator) but can be attacked to harm users! We show that even with defenses, AI agents can still be compromised via indirect prompt injections via "adaptive attacks" in our NAACL 2025 findings paper 🧵 and links below
0
4
7
🚀 Introducing 𝗦𝗲𝗮𝗿𝗰𝗵-𝗥𝟭 – the first 𝗿𝗲𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸-𝗥𝟭 (𝘇𝗲𝗿𝗼) for training reasoning and search-augmented LLM agents with reinforcement learning! This is a step towards training an 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗢𝗽𝗲𝗻𝗔𝗜 “𝗗𝗲𝗲𝗽
github.com
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL - PeterGriffinJin/Search-R1
44
328
3K
Since our models revise the response only when the self-rewarding tokens are activated, they are more efficient in test time compute scaling. 🤗(5/5)
0
0
6
RL does not consistently improve RM accuracy but rather exploring different trade-off in RM accuracy to maximize the final test accuracy 🤗(4/5)
0
1
6
When trained with Qwen-2.5-Math-7B-base, our models (1) significantly outperform intrinsic correction with/or without training; (2) boost the final test accuracy by the additional benefit from self-rewarding correction. 🤗(3/5)
0
1
6
To synthesize long CoT trajectories with self-rewarding & self-correcting from the base model: (1) sequentially prompt the base model to generate the data step by step; (2) filter and keep only the trajectories with the desired patterns. 🤗(2/5)
0
0
6