Zhihui Xie
@_zhihuixie
Followers
407
Following
1K
Media
20
Statuses
197
PhD student @hkunlp2020 | Intern @AIatMeta | Previously @sjtu1896
Joined July 2019
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code LLM to date.
3
37
126
🚀 Thrilled to share our #NeurIPS2025 paper DynaAct: Large Language Model Reasoning with Dynamic Action Spaces A new test-time scaling view — optimizing the action space itself, while providing a general MCTS acceleration framework for reasoning. 💻 https://t.co/FFWIDBcbCV
2
15
50
Agents are killing it at coding, deep research, Q&A...But the next frontier? Seamlessly orchestrating multiple apps to solve tasks end2end in real states -- Toolathlon is just for this! So if you want to make agents truly useful in the beautiful mess of real work, don't miss it!
🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
0
10
26
🚀 Excited to share our latest work in RL4LLM system. 🎉 ROLL Flash enables fully asynchronous overlap of generation, interaction, rewards, and training through Fine-grained Parallelism and Rollout–Train Decoupling. 1) 2.24× faster on RLVR; 2.72× faster on agentic tasks 2)
3
10
76
🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
6
28
163
👋Say Hi to MiMo-Audio! Our BREAKTHROUGH in general-purpose audio intelligence. 🎯 Scaling pretraining to 100M+ hours leads to EMERGENCE of few-shot generalization across diverse audio tasks! 🔥 Post-trained MiMo-Audio-7B-Instruct: • crushes benchmarks: SOTA on MMSU, MMAU,
6
57
326
Prof. Chen Ning Yang, a world-renowned physicist, Nobel Laureate in Physics, Academician of the Chinese Academy of Sciences, Professor at Tsinghua University, and Honorary Director of the Institute for Advanced Study at Tsinghua University, passed away in Beijing due to illness
212
751
4K
💃New Multi-Agent RL Method: WaltzRL💃 📝: https://t.co/KE8dM9kX1r - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5
5
33
151
Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸♂️ 💪 📝: https://t.co/VAXtSC4GGp - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward
4
53
325
Collecting large human preference data is expensive—the biggest bottleneck in reward modeling. In our #NeurIPS2025 paper, we introduce latent-space synthesis for preference data, which is 18× faster and uses a network that’s 16,000× smaller (0.5M vs 8B parameters) than
5
59
319
Label-free RL for reasoning models often latches onto spurious signals (e.g., majority vote), hurting scalability. In our work, RESTRAIN considers entire answer distribution—downweights overconfident rollouts & low-consistency examples, keeps useful reasoning paths.
🌀New Self-Driven RL Method: RESTRAIN 🌀 📝: https://t.co/x4EgHfxZfG - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths: ✔️ Uses all rollouts, not just the majority, ✔️ Offsets
0
3
10
The full Dream-Coder pipeline is now open-sourced—covering data prep, training, and evaluation. Check it out!
github.com
Contribute to DreamLM/Dream-Coder development by creating an account on GitHub.
1
9
25
Supplementary information for the new DeepSeek R1 Nature paper is very interesting! Details on training data, hyperparameters, base model importance, and more.
10
153
924
Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
2
24
90
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
5
87
425
Introducing 𝐉𝐚𝐢𝐥𝐛𝐫𝐞𝐚𝐤 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 🧨 (EMNLP '25 Findings) We propose a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination🧵
1
16
32
🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
1
72
417
🚀 OSWorld gets a major upgrade! OSWorld-Verified: 15 months community feedback → 300+ fixes (ambiguity, graders…), 50x faster eval through AWS parallelization More apple-to-apple comparison for reliable CUA evaluation ✨ 👇 https://t.co/4ndsR1JCkz
xlang.ai
We've systematically addressed 300+ issues in OSWorld through a comprehensive refinement process. OSWorld-Verified delivers more reliable evaluation signals through improved infrastructure and...
8
31
149
We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] https://t.co/naBIDnyvYY 📌
14
102
466
🚀 MiMo‑VL 2508 is live! Same size, much smarter. We’ve upgraded performance, thinking control, and overall user experience. 📈 Benchmark gains across image + video: MMMU 70.6, VideoMME 70.8. Consistent improvements across the board. 🤖 Thinking Control: toggle reasoning with
2
16
91
Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens
42
248
2K