Wei Xiong @weixiong_1 X Profile

Wei Xiong

@weixiong_1

Followers

1K

Following

661

Media

16

Statuses

179

Statistical learning theory, Post-training of LLMs PhD Student @IllinoisCS, prev @Meta @GoogleDeepMind @MSFTResearch @USTC

https://t.co/pYdVWGNs4G

Joined February 2024

Don't wanna be here? Send us removal request.

Wei Xiong

@weixiong_1

1 month

keep sampling the prompts until you get enough learning signals! robust adaptive sampling framework for GRPO-like training

Hanze Dong

@hendrydong

1 month

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3 https://t.co/7qLywG2KWR

0

14

Wei Xiong

@weixiong_1

19 days

😃 thanks, google!

Google.org

@Googleorg

19 days

🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: https://t.co/0Pvuv6hsgP

12

6

101

Jason Weston

@jaseweston

25 days

🌀Agent Learning via Early Experience🌀 📝: https://t.co/VsqQHTTrBN - SFT for agents is sparse; RL on long-horizons is hard We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states - Strong improvements over

3

43

190

Chenghao Yang

@chrome1996

1 month

Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and

4

19

93

Fengzhuo Zhang

@FengzhuoZhang

1 month

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In

1

32

50

Hanze Dong

@hendrydong

1 month

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. https://t.co/kJTeVek1S3 https://t.co/7qLywG2KWR

9

24

181

Chenlu Ye

@ye_chenlu

2 months

PROF🌀Right answer, flawed reason?🤔🌀 📄 https://t.co/8kFrxKQbVW Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning

2

11

37

Jason Weston

@jaseweston

3 months

🪜Introducing: StepWiser🦉 📝: https://t.co/RXOjaMjHI1 - Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment. - Trained by RL using relative outcomes of rollouts. Results: (1) SOTA performance on ProcessBench! (2) Improves policy at train time. (3)

11

96

482

Kaiyu Yang

@KaiyuYang4

4 months

🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California

8

57

239

Daniel Kang

@daniel_d_kang

5 months

Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7

1

8

26

Jiarui Yao

@ExplainMiracles

6 months

We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs. – Achieves 2–4× faster convergence than RAFT – Improves accuracy on math

0

28

89

Wei Xiong

@weixiong_1

7 months

Surprised by the small performance gap between RAFT and Reinforce/GRPO. We may need more fine-grained negative signals to better guide learning.🧐

Hanze Dong

@hendrydong

7 months

🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️

2

8

93

Hanze Dong

@hendrydong

7 months

🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻 https://t.co/PtWHNmLkPS 📄 https://t.co/pYNnJQkJEU 👀RAFT was invited to ICLR25! Come & Chat☕️

arxiv.org

Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical...

9

101

466

Qiusi Zhan

@ZhanQiusi1

8 months

Our NAACL 2025 findings paper demonstrates that securing AI agents requires more than off-the-shelf defenses—adaptive attacks continuously evolve to bypass them. If you have any questions or want to discuss more, feel free to reach out!

Daniel Kang

@daniel_d_kang

8 months

AI agents are increasingly popular (e.g., OpenAI's operator) but can be attacked to harm users! We show that even with defenses, AI agents can still be compromised via indirect prompt injections via "adaptive attacks" in our NAACL 2025 findings paper 🧵 and links below

0

4

7

Bowen Jin

@BowenJin13

9 months

🚀 Introducing 𝗦𝗲𝗮𝗿𝗰𝗵-𝗥𝟭 – the first 𝗿𝗲𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸-𝗥𝟭 (𝘇𝗲𝗿𝗼) for training reasoning and search-augmented LLM agents with reinforcement learning! This is a step towards training an 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗢𝗽𝗲𝗻𝗔𝗜 “𝗗𝗲𝗲𝗽

github.com

Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL - PeterGriffinJin/Search-R1

44

328

3K

Wei Xiong

@weixiong_1

9 months

Paper: https://t.co/uAL5iZ8dJY Github:

github.com

Recipes to train the self-rewarding reasoning LLMs. - RLHFlow/Self-rewarding-reasoning-LLM

0

3

16

Wei Xiong

@weixiong_1

9 months

Since our models revise the response only when the self-rewarding tokens are activated, they are more efficient in test time compute scaling. 🤗(5/5)

0

6

Wei Xiong

@weixiong_1

9 months

RL does not consistently improve RM accuracy but rather exploring different trade-off in RM accuracy to maximize the final test accuracy 🤗(4/5)

0

1

6

Wei Xiong

@weixiong_1

9 months

When trained with Qwen-2.5-Math-7B-base, our models (1) significantly outperform intrinsic correction with/or without training; (2) boost the final test accuracy by the additional benefit from self-rewarding correction. 🤗(3/5)

0

1

6

Wei Xiong

@weixiong_1

9 months

To synthesize long CoT trajectories with self-rewarding & self-correcting from the base model: (1) sequentially prompt the base model to generate the data step by step; (2) filter and keep only the trajectories with the desired patterns. 🤗(2/5)

0

6