ZiniuLi Profile Banner
Ziniu Li Profile
Ziniu Li

@ZiniuLi

Followers
852
Following
954
Media
47
Statuses
185

Ph.D. student @ CUHK, Shenzhen. Intern @Bytedance (Seed-Horizon) Working on RL and LLMs. Prev: Intern @Tencent (AI Lab)

Shenzhen, China
Joined August 2017
Don't wanna be here? Send us removal request.
@zhiyuan_nlper
zeng zhiyuan
5 days
πŸš€ Thrilled to share our new work, "RLoop: A Self-Improving Framework for Reinforcement Learning"! https://t.co/E2IUxgYR3t
2
7
26
@ZiniuLi
Ziniu Li
14 days
Excited to share our research on scaling looped language models! We explore how next-generation foundation models can scale latent reasoning and more efficiently leverage parameters for knowledge manipulation!
@RidgerZhu
Rui-Jie (Ridger) Zhu
14 days
Thrilled to release new paper: β€œScaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
5
3
67
@ZiniuLi
Ziniu Li
1 month
Thank @YiyouSun for adding our work on Knapsack RL to this excellent collection. I strongly believe that focusing on the hard-tier problemsβ€”where traditional RLVR pipelines failβ€”is crucial for advancing our understanding and methodologies. This living repository is a vital
@YiyouSun
Yiyou Sun
1 month
πŸš€ This is a call to tackle the hard-tier problems where RLVR pipelines stall with pass@k=0 (no reward, no gradient). To push further, we need to train on hard subsets and leverage signals from tough data β€” that’s where discovery happens. πŸ‘‰ https://t.co/W0XJYWDNHo The grand
0
0
7
@ZiniuLi
Ziniu Li
1 month
Thanks to AK for reposting our work! It's exciting to see how the knapsack formulation reshapes the view of system efficiency to unlock exploration and scale RL.
@_akhaliq
AK
1 month
Knapsack RL Unlocking Exploration of LLMs via Optimizing Budget Allocation
0
0
4
@ZiniuLi
Ziniu Li
1 month
[8/n] If you found this work interesting, please upvote our paper! πŸ™ πŸ“„ Paper:
Tweet card summary image
huggingface.co
1
2
19
@ZiniuLi
Ziniu Li
1 month
[7/n] The budget distribution in action πŸ“Š Our knapsack allocation creates a smart distribution: Most tasks: 1-10 rollouts (they don't need more!) Hard tasks: Up to 93 rollouts πŸ”₯ πŸ’‘ The key insight: This is a computational free lunch! β†’ Same total compute as uniform
1
2
11
@ZiniuLi
Ziniu Li
1 month
[6/n] Why it works: Better gradient efficiency πŸ“ˆ Knapsack-GRPO maintains 70-80% effective gradient ratio vs. 50% for uniform GRPO throughout training That's +20-40% more non-zero gradients!πŸ’‘ What gets solved: - Hard tasks: 549 β†’ 306 βœ… (GRPO wastes compute here) -
1
1
11
@ZiniuLi
Ziniu Li
1 month
[5/n] Results: Consistent wins on math benchmarks πŸ“ˆ 10000+ GPU hours across 4 models Γ— 6 math tasks: Knapsack-GRPO vs. uniform GRPO: βœ… +2-4 avg points across all settings βœ… Up to +9 points on individual benchmarks βœ… 23/24 wins πŸ’‘ Key insight: Better exploration
1
1
11
@ZiniuLi
Ziniu Li
1 month
[4/n] Our solution: Knapsack-based exploration πŸŽ’ We formalize adaptive budget allocation as a constrained optimization: max Ξ£ Value(N_i, p_i) s.t. Ξ£ N_i = N_total πŸ’‘ The knapsack analogy: Each prompt with an exploration budget = an item Rollouts N_i = weight (cost) Learning
1
1
11
@ZiniuLi
Ziniu Li
1 month
[3/n] The exploration-computation dilemma βš–οΈ Hard tasks need many rollouts to explore successfully. But naive solution = scale up N uniformly? ❌ Computationally prohibitive! πŸ“Š Example: To match our adaptive budget (93 rollouts for hard tasks), uniform allocation would need:
1
1
11
@ZiniuLi
Ziniu Li
1 month
[2/n] Theoretical Understanding: We prove how many trials are needed to get non-zero gradients (Theorem 1) Both easy (p→1) and hard (p→0) tasks need many rollouts to get contrastive outcomes for GRPO For extremely hard tasks such as p=0.01, we would require a large value of
1
1
12
@ZiniuLi
Ziniu Li
1 month
[1/n] Real Issue: Wasted Gradients in GRPO 🚨 In GRPO, if all rollouts have the same reward: βœ… Easy tasks β†’ all positive β†’ 0 gradient ❌ Hard tasks β†’ all negative β†’ 0 gradient β†’ The policy gets no learning signal. Empirically (blue curve πŸ‘‡), effective gradients drop
1
1
23
@ZiniuLi
Ziniu Li
1 month
πŸš€ Excited to share our work at Bytedance Seed! Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation πŸŽ’ Exploration in LLM training is crucial but expensive. Uniform rollout allocation is wasteful: βœ… Easy tasks β†’ always solved β†’ 0 gradient ❌ Hard tasks β†’
13
102
642
@yizhilll
Yizhi Li @ EMNLP2025
3 months
[1/n] Introducing TreePO🌲, a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper:
Tweet card summary image
huggingface.co
2
12
58
@GeZhang86038849
Ge Zhang
3 months
Is text-only information enough for LLM/VLM Web Agents? πŸ€” Clearly not. πŸ™…β€β™‚οΈ The modern web is a rich tapestry of text, images πŸ–ΌοΈ, and videos πŸŽ₯. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. 🌐 We're introducing MM-BrowseComp πŸš€, a new
1
31
86
@liujiashuo77
Jiashuo Liu
3 months
We built FutureX, the world’s first live benchmark for real future prediction β€” politics, economy, culture, sports, etc. Among 23 AI agents, #Grok4 ranked #1 πŸ† Elon didn’t lie. @elonmusk your model sees further πŸš€πŸ€ LeaderBoard: https://t.co/fwck0NROHZ
229
205
1K
@xueyao_98
Xueyao Zhang
4 months
πŸš€ Our #ACL2025 work INTP unveils how Preference Alignment improves diverse TTS models (AR, Flow-matching, and Masked Generative Model)! Unlock the secrets: ➑️ Customized Post-Training. ➑️ Human-Guided Unlearning. ➑️ Heterogeneous preference pairs to avoid reward hacking.
3
5
18
@GeZhang86038849
Ge Zhang
4 months
Amazing work by @RidgerZhu ,more resources to investigate the mechanism behind the hybrid linear attention. Resources: Paper: https://t.co/QEKocLZBEE Huggingface CKPT Link:
Tweet card summary image
huggingface.co
@RidgerZhu
Rui-Jie (Ridger) Zhu
4 months
Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tick🧢
2
4
12
@zhengyang_42
Zhengyang Tang
4 months
πŸš€ Thrilled to announce that our paper "SCRIT: Self-Evolving LLM Critique without Human or Stronger Models" was accepted to #COLM2025! We enable LLMs to self-improve critique abilities β€” zero human annotations, zero stronger models needed! πŸ”„βœ¨ Looking forward to meeting
@ZiniuLi
Ziniu Li
10 months
πŸš€ Critique abilities are key for scaling LLMs, but current open-source models fall short. We introduce SCRIT: a framework with scalable oversight that enables LLMs to self-improve their critique skills✨ We’ve built a pipeline to generate high-quality synthetic critique data
1
3
8
@zhengyang_42
Zhengyang Tang
5 months
We’re excited to share our new paper β€œCoRT: Code-integrated Reasoning within Thinking”! πŸ€– A post-training framework that teaches Large Reasoning Models (LRMs) to better leverage Code Interpreters for enhanced mathematical reasoning. πŸ” Key Highlights: Strategic hint
1
3
23