Ziniu Li
@ZiniuLi
Followers
852
Following
954
Media
47
Statuses
185
Ph.D. student @ CUHK, Shenzhen. Intern @Bytedance (Seed-Horizon) Working on RL and LLMs. Prev: Intern @Tencent (AI Lab)
Shenzhen, China
Joined August 2017
π Thrilled to share our new work, "RLoop: A Self-Improving Framework for Reinforcement Learning"! https://t.co/E2IUxgYR3t
2
7
26
Excited to share our research on scaling looped language models! We explore how next-generation foundation models can scale latent reasoning and more efficiently leverage parameters for knowledge manipulation!
Thrilled to release new paper: βScaling Latent Reasoning via Looped Language Models.β TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
5
3
67
Thank @YiyouSun for adding our work on Knapsack RL to this excellent collection. I strongly believe that focusing on the hard-tier problemsβwhere traditional RLVR pipelines failβis crucial for advancing our understanding and methodologies. This living repository is a vital
π This is a call to tackle the hard-tier problems where RLVR pipelines stall with pass@k=0 (no reward, no gradient). To push further, we need to train on hard subsets and leverage signals from tough data β thatβs where discovery happens. π https://t.co/W0XJYWDNHo The grand
0
0
7
[8/n] If you found this work interesting, please upvote our paper! π π Paper:
huggingface.co
1
2
19
[7/n] The budget distribution in action π Our knapsack allocation creates a smart distribution: Most tasks: 1-10 rollouts (they don't need more!) Hard tasks: Up to 93 rollouts π₯ π‘ The key insight: This is a computational free lunch! β Same total compute as uniform
1
2
11
[6/n] Why it works: Better gradient efficiency π Knapsack-GRPO maintains 70-80% effective gradient ratio vs. 50% for uniform GRPO throughout training That's +20-40% more non-zero gradients!π‘ What gets solved: - Hard tasks: 549 β 306 β
(GRPO wastes compute here) -
1
1
11
[5/n] Results: Consistent wins on math benchmarks π 10000+ GPU hours across 4 models Γ 6 math tasks: Knapsack-GRPO vs. uniform GRPO: β
+2-4 avg points across all settings β
Up to +9 points on individual benchmarks β
23/24 wins π‘ Key insight: Better exploration
1
1
11
[4/n] Our solution: Knapsack-based exploration π We formalize adaptive budget allocation as a constrained optimization: max Ξ£ Value(N_i, p_i) s.t. Ξ£ N_i = N_total π‘ The knapsack analogy: Each prompt with an exploration budget = an item Rollouts N_i = weight (cost) Learning
1
1
11
[3/n] The exploration-computation dilemma βοΈ Hard tasks need many rollouts to explore successfully. But naive solution = scale up N uniformly? β Computationally prohibitive! π Example: To match our adaptive budget (93 rollouts for hard tasks), uniform allocation would need:
1
1
11
[2/n] Theoretical Understanding: We prove how many trials are needed to get non-zero gradients (Theorem 1) Both easy (pβ1) and hard (pβ0) tasks need many rollouts to get contrastive outcomes for GRPO For extremely hard tasks such as p=0.01, we would require a large value of
1
1
12
[1/n] Real Issue: Wasted Gradients in GRPO π¨ In GRPO, if all rollouts have the same reward: β
Easy tasks β all positive β 0 gradient β Hard tasks β all negative β 0 gradient β The policy gets no learning signal. Empirically (blue curve π), effective gradients drop
1
1
23
π Excited to share our work at Bytedance Seed! Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation π Exploration in LLM training is crucial but expensive. Uniform rollout allocation is wasteful: β
Easy tasks β always solved β 0 gradient β Hard tasks β
13
102
642
[1/n] Introducing TreePOπ², a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper:
huggingface.co
2
12
58
Is text-only information enough for LLM/VLM Web Agents? π€ Clearly not. π
ββοΈ The modern web is a rich tapestry of text, images πΌοΈ, and videos π₯. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. π We're introducing MM-BrowseComp π, a new
1
31
86
We built FutureX, the worldβs first live benchmark for real future prediction β politics, economy, culture, sports, etc. Among 23 AI agents, #Grok4 ranked #1 π Elon didnβt lie. @elonmusk your model sees further ππ LeaderBoard: https://t.co/fwck0NROHZ
229
205
1K
π Our #ACL2025 work INTP unveils how Preference Alignment improves diverse TTS models (AR, Flow-matching, and Masked Generative Model)! Unlock the secrets: β‘οΈ Customized Post-Training. β‘οΈ Human-Guided Unlearning. β‘οΈ Heterogeneous preference pairs to avoid reward hacking.
3
5
18
Amazing work by @RidgerZhu οΌmore resources to investigate the mechanism behind the hybrid linear attention. Resources: Paper: https://t.co/QEKocLZBEE Huggingface CKPT Link:
huggingface.co
Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tickπ§Ά
2
4
12
π Thrilled to announce that our paper "SCRIT: Self-Evolving LLM Critique without Human or Stronger Models" was accepted to #COLM2025! We enable LLMs to self-improve critique abilities β zero human annotations, zero stronger models needed! πβ¨ Looking forward to meeting
π Critique abilities are key for scaling LLMs, but current open-source models fall short. We introduce SCRIT: a framework with scalable oversight that enables LLMs to self-improve their critique skillsβ¨ Weβve built a pipeline to generate high-quality synthetic critique data
1
3
8
Weβre excited to share our new paper βCoRT: Code-integrated Reasoning within Thinkingβ! π€ A post-training framework that teaches Large Reasoning Models (LRMs) to better leverage Code Interpreters for enhanced mathematical reasoning. π Key Highlights: Strategic hint
1
3
23