Wei Xiong Profile
Wei Xiong

@weixiong_1

Followers
1K
Following
661
Media
16
Statuses
179

Statistical learning theory, Post-training of LLMs PhD Student @IllinoisCS, prev @Meta @GoogleDeepMind @MSFTResearch @USTC

Joined February 2024
Don't wanna be here? Send us removal request.
@weixiong_1
Wei Xiong
1 month
keep sampling the prompts until you get enough learning signals! robust adaptive sampling framework for GRPO-like training
@hendrydong
Hanze Dong
1 month
💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3 https://t.co/7qLywG2KWR
0
0
14
@weixiong_1
Wei Xiong
19 days
😃 thanks, google!
@Googleorg
Google.org
19 days
🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: https://t.co/0Pvuv6hsgP
12
6
101
@jaseweston
Jason Weston
25 days
🌀Agent Learning via Early Experience🌀 📝: https://t.co/VsqQHTTrBN - SFT for agents is sparse; RL on long-horizons is hard We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states - Strong improvements over
3
43
190
@chrome1996
Chenghao Yang
1 month
Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and
4
19
93
@FengzhuoZhang
Fengzhuo Zhang
1 month
Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In
1
32
50
@hendrydong
Hanze Dong
1 month
💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3 https://t.co/7qLywG2KWR
9
24
181
@ye_chenlu
Chenlu Ye
2 months
PROF🌀Right answer, flawed reason?🤔🌀 📄 https://t.co/8kFrxKQbVW Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning
2
11
37
@jaseweston
Jason Weston
3 months
🪜Introducing: StepWiser🦉 📝: https://t.co/RXOjaMjHI1 - Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment. - Trained by RL using relative outcomes of rollouts. Results: (1) SOTA performance on ProcessBench! (2) Improves policy at train time. (3)
11
96
482
@KaiyuYang4
Kaiyu Yang
4 months
🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California
8
57
239
@daniel_d_kang
Daniel Kang
5 months
Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7
1
8
26
@ExplainMiracles
Jiarui Yao
6 months
We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs. – Achieves 2–4× faster convergence than RAFT – Improves accuracy on math
0
28
89
@weixiong_1
Wei Xiong
7 months
Surprised by the small performance gap between RAFT and Reinforce/GRPO. We may need more fine-grained negative signals to better guide learning.🧐
@hendrydong
Hanze Dong
7 months
🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️
2
8
93
@hendrydong
Hanze Dong
7 months
🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️
Tweet card summary image
arxiv.org
Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical...
9
101
466
@ZhanQiusi1
Qiusi Zhan
8 months
Our NAACL 2025 findings paper demonstrates that securing AI agents requires more than off-the-shelf defenses—adaptive attacks continuously evolve to bypass them. If you have any questions or want to discuss more, feel free to reach out!
@daniel_d_kang
Daniel Kang
8 months
AI agents are increasingly popular (e.g., OpenAI's operator) but can be attacked to harm users! We show that even with defenses, AI agents can still be compromised via indirect prompt injections via "adaptive attacks" in our NAACL 2025 findings paper 🧵 and links below
0
4
7
@BowenJin13
Bowen Jin
9 months
🚀 Introducing 𝗦𝗲𝗮𝗿𝗰𝗵-𝗥𝟭 – the first 𝗿𝗲𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸-𝗥𝟭 (𝘇𝗲𝗿𝗼) for training reasoning and search-augmented LLM agents with reinforcement learning! This is a step towards training an 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗢𝗽𝗲𝗻𝗔𝗜 “𝗗𝗲𝗲𝗽
Tweet card summary image
github.com
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL - PeterGriffinJin/Search-R1
44
328
3K
@weixiong_1
Wei Xiong
9 months
Since our models revise the response only when the self-rewarding tokens are activated, they are more efficient in test time compute scaling. 🤗(5/5)
0
0
6
@weixiong_1
Wei Xiong
9 months
RL does not consistently improve RM accuracy but rather exploring different trade-off in RM accuracy to maximize the final test accuracy 🤗(4/5)
0
1
6
@weixiong_1
Wei Xiong
9 months
When trained with Qwen-2.5-Math-7B-base, our models (1) significantly outperform intrinsic correction with/or without training; (2) boost the final test accuracy by the additional benefit from self-rewarding correction. 🤗(3/5)
0
1
6
@weixiong_1
Wei Xiong
9 months
To synthesize long CoT trajectories with self-rewarding & self-correcting from the base model: (1) sequentially prompt the base model to generate the data step by step; (2) filter and keep only the trajectories with the desired patterns. 🤗(2/5)
0
0
6