Ziniu Li @ZiniuLi X Profile

Ziniu Li

@ZiniuLi

Followers

852

Following

954

Media

47

Statuses

185

Ph.D. student @ CUHK, Shenzhen. Intern @Bytedance (Seed-Horizon) Working on RL and LLMs. Prev: Intern @Tencent (AI Lab)

https://t.co/6L0opnD0jC

Shenzhen, China

Joined August 2017

Don't wanna be here? Send us removal request.

zeng zhiyuan

@zhiyuan_nlper

5 days

🚀 Thrilled to share our new work, "RLoop: A Self-Improving Framework for Reinforcement Learning"! https://t.co/E2IUxgYR3t

2

7

26

Ziniu Li

@ZiniuLi

14 days

Excited to share our research on scaling looped language models! We explore how next-generation foundation models can scale latent reasoning and more efficiently leverage parameters for knowledge manipulation!

Rui-Jie (Ridger) Zhu

@RidgerZhu

14 days

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

5

3

67

Ziniu Li

@ZiniuLi

1 month

Thank @YiyouSun for adding our work on Knapsack RL to this excellent collection. I strongly believe that focusing on the hard-tier problems—where traditional RLVR pipelines fail—is crucial for advancing our understanding and methodologies. This living repository is a vital

Yiyou Sun

@YiyouSun

1 month

🚀 This is a call to tackle the hard-tier problems where RLVR pipelines stall with pass@k=0 (no reward, no gradient). To push further, we need to train on hard subsets and leverage signals from tough data — that’s where discovery happens. 👉 https://t.co/W0XJYWDNHo The grand

0

7

Ziniu Li

@ZiniuLi

1 month

Thanks to AK for reposting our work! It's exciting to see how the knapsack formulation reshapes the view of system efficiency to unlock exploration and scale RL.

AK

@_akhaliq

1 month

Knapsack RL Unlocking Exploration of LLMs via Optimizing Budget Allocation

0

4

Ziniu Li

@ZiniuLi

1 month

[8/n] If you found this work interesting, please upvote our paper! 🙏 📄 Paper:

huggingface.co

1

2

19

Ziniu Li

@ZiniuLi

1 month

[7/n] The budget distribution in action 📊 Our knapsack allocation creates a smart distribution: Most tasks: 1-10 rollouts (they don't need more!) Hard tasks: Up to 93 rollouts 🔥 💡 The key insight: This is a computational free lunch! → Same total compute as uniform

1

2

11

Ziniu Li

@ZiniuLi

1 month

[6/n] Why it works: Better gradient efficiency 📈 Knapsack-GRPO maintains 70-80% effective gradient ratio vs. 50% for uniform GRPO throughout training That's +20-40% more non-zero gradients!💡 What gets solved: - Hard tasks: 549 → 306 ✅ (GRPO wastes compute here) -

1

11

Ziniu Li

@ZiniuLi

1 month

[5/n] Results: Consistent wins on math benchmarks 📈 10000+ GPU hours across 4 models × 6 math tasks: Knapsack-GRPO vs. uniform GRPO: ✅ +2-4 avg points across all settings ✅ Up to +9 points on individual benchmarks ✅ 23/24 wins 💡 Key insight: Better exploration

1

11

Ziniu Li

@ZiniuLi

1 month

[4/n] Our solution: Knapsack-based exploration 🎒 We formalize adaptive budget allocation as a constrained optimization: max Σ Value(N_i, p_i) s.t. Σ N_i = N_total 💡 The knapsack analogy: Each prompt with an exploration budget = an item Rollouts N_i = weight (cost) Learning

1

11

Ziniu Li

@ZiniuLi

1 month

[3/n] The exploration-computation dilemma ⚖️ Hard tasks need many rollouts to explore successfully. But naive solution = scale up N uniformly? ❌ Computationally prohibitive! 📊 Example: To match our adaptive budget (93 rollouts for hard tasks), uniform allocation would need:

1

11

Ziniu Li

@ZiniuLi

1 month

[2/n] Theoretical Understanding: We prove how many trials are needed to get non-zero gradients (Theorem 1) Both easy (p→1) and hard (p→0) tasks need many rollouts to get contrastive outcomes for GRPO For extremely hard tasks such as p=0.01, we would require a large value of

1

12

Ziniu Li

@ZiniuLi

1 month

[1/n] Real Issue: Wasted Gradients in GRPO 🚨 In GRPO, if all rollouts have the same reward: ✅ Easy tasks → all positive → 0 gradient ❌ Hard tasks → all negative → 0 gradient → The policy gets no learning signal. Empirically (blue curve 👇), effective gradients drop

1

23

Ziniu Li

@ZiniuLi

1 month

🚀 Excited to share our work at Bytedance Seed! Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation 🎒 Exploration in LLM training is crucial but expensive. Uniform rollout allocation is wasteful: ✅ Easy tasks → always solved → 0 gradient ❌ Hard tasks →

13

102

642

Yizhi Li @ EMNLP2025

@yizhilll

3 months

[1/n] Introducing TreePO🌲, a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper:

huggingface.co

2

12

58

Ge Zhang

@GeZhang86038849

3 months

Is text-only information enough for LLM/VLM Web Agents? 🤔 Clearly not. 🙅‍♂️ The modern web is a rich tapestry of text, images 🖼️, and videos 🎥. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. 🌐 We're introducing MM-BrowseComp 🚀, a new

1

31

86

Jiashuo Liu

@liujiashuo77

3 months

We built FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc. Among 23 AI agents, #Grok4 ranked #1 🏆 Elon didn’t lie. @elonmusk your model sees further 🚀🍀 LeaderBoard: https://t.co/fwck0NROHZ

229

205

1K

Xueyao Zhang

@xueyao_98

4 months

🚀 Our #ACL2025 work INTP unveils how Preference Alignment improves diverse TTS models (AR, Flow-matching, and Masked Generative Model)! Unlock the secrets: ➡️ Customized Post-Training. ➡️ Human-Guided Unlearning. ➡️ Heterogeneous preference pairs to avoid reward hacking.

3

5

18

Ge Zhang

@GeZhang86038849

4 months

Amazing work by @RidgerZhu ，more resources to investigate the mechanism behind the hybrid linear attention. Resources: Paper: https://t.co/QEKocLZBEE Huggingface CKPT Link:

huggingface.co

Rui-Jie (Ridger) Zhu

@RidgerZhu

4 months

Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tick🧶

2

4

12

Zhengyang Tang

@zhengyang_42

4 months

🚀 Thrilled to announce that our paper "SCRIT: Self-Evolving LLM Critique without Human or Stronger Models" was accepted to #COLM2025! We enable LLMs to self-improve critique abilities — zero human annotations, zero stronger models needed! 🔄✨ Looking forward to meeting

Ziniu Li

@ZiniuLi

10 months

🚀 Critique abilities are key for scaling LLMs, but current open-source models fall short. We introduce SCRIT: a framework with scalable oversight that enables LLMs to self-improve their critique skills✨ We’ve built a pipeline to generate high-quality synthetic critique data

1

3

8

Zhengyang Tang

@zhengyang_42

5 months

We’re excited to share our new paper “CoRT: Code-integrated Reasoning within Thinking”! 🤖 A post-training framework that teaches Large Reasoning Models (LRMs) to better leverage Code Interpreters for enhanced mathematical reasoning. 🔍 Key Highlights: Strategic hint

1

3

23