_zhihuixie Profile Banner
Zhihui Xie Profile
Zhihui Xie

@_zhihuixie

Followers
407
Following
1K
Media
20
Statuses
197

PhD student @hkunlp2020 | Intern @AIatMeta | Previously @sjtu1896

Joined July 2019
Don't wanna be here? Send us removal request.
@_zhihuixie
Zhihui Xie
4 months
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code  LLM to date.
3
37
126
@xlzhao_hku
Xueliang Zhao
4 days
🚀 Thrilled to share our #NeurIPS2025 paper DynaAct: Large Language Model Reasoning with Dynamic Action Spaces A new test-time scaling view — optimizing the action space itself, while providing a general MCTS acceleration framework for reasoning. 💻 https://t.co/FFWIDBcbCV
2
15
50
@lockonlvange
Junlong Li
17 days
Agents are killing it at coding, deep research, Q&A...But the next frontier? Seamlessly orchestrating multiple apps to solve tasks end2end in real states -- Toolathlon is just for this! So if you want to make agents truly useful in the beautiful mess of real work, don't miss it!
@junxian_he
Junxian He
18 days
🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
0
10
26
@AcceptOral
Han Lu
16 days
🚀 Excited to share our latest work in RL4LLM system. 🎉 ROLL Flash enables fully asynchronous overlap of generation, interaction, rewards, and training through Fine-grained Parallelism and Rollout–Train Decoupling. 1) 2.24× faster on RLVR; 2.72× faster on agentic tasks 2)
3
10
76
@junxian_he
Junxian He
18 days
🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
6
28
163
@_TobiasLee
Lei Li
24 days
👋Say Hi to MiMo-Audio! Our BREAKTHROUGH in general-purpose audio intelligence. 🎯 Scaling pretraining to 100M+ hours leads to EMERGENCE of few-shot generalization across diverse audio tasks! 🔥 Post-trained MiMo-Audio-7B-Instruct: • crushes benchmarks: SOTA on MMSU, MMAU,
6
57
326
@Tsinghua_Uni
Tsinghua University
30 days
Prof. Chen Ning Yang, a world-renowned physicist, Nobel Laureate in Physics, Academician of the Chinese Academy of Sciences, Professor at Tsinghua University, and Honorary Director of the Institute for Advanced Study at Tsinghua University, passed away in Beijing due to illness
212
751
4K
@jaseweston
Jason Weston
1 month
💃New Multi-Agent RL Method: WaltzRL💃 📝: https://t.co/KE8dM9kX1r - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5
5
33
151
@jaseweston
Jason Weston
1 month
Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: https://t.co/VAXtSC4GGp - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward
4
53
325
@SharonYixuanLi
Sharon Li
1 month
Collecting large human preference data is expensive—the biggest bottleneck in reward modeling. In our #NeurIPS2025 paper, we introduce latent-space synthesis for preference data, which is 18× faster and uses a network that’s 16,000× smaller (0.5M vs 8B parameters) than
5
59
319
@Zhaoning1996
Zhaoning Yu
1 month
Label-free RL for reasoning models often latches onto spurious signals (e.g., majority vote), hurting scalability. In our work, RESTRAIN considers entire answer distribution—downweights overconfident rollouts & low-consistency examples, keeps useful reasoning paths.
@jaseweston
Jason Weston
1 month
🌀New Self-Driven RL Method: RESTRAIN 🌀 📝: https://t.co/x4EgHfxZfG - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths: ✔️ Uses all rollouts, not just the majority, ✔️ Offsets
0
3
10
@_zhihuixie
Zhihui Xie
2 months
The full Dream-Coder pipeline is now open-sourced—covering data prep, training, and evaluation. Check it out!
Tweet card summary image
github.com
Contribute to DreamLM/Dream-Coder development by creating an account on GitHub.
@_zhihuixie
Zhihui Xie
4 months
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code  LLM to date.
1
9
25
@rosstaylor90
Ross Taylor
2 months
Supplementary information for the new DeepSeek R1 Nature paper is very interesting! Details on training data, hyperparameters, base model importance, and more.
10
153
924
@tli104
Tianjian Li
2 months
Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!
@jaseweston
Jason Weston
2 months
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
2
24
90
@jaseweston
Jason Weston
2 months
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
5
87
425
@jackjingyuzhang
Jack Jingyu Zhang
3 months
Introducing 𝐉𝐚𝐢𝐥𝐛𝐫𝐞𝐚𝐤 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 🧨 (EMNLP '25 Findings) We propose a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination🧵
1
16
32
@jaseweston
Jason Weston
3 months
🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
1
72
417
@TianbaoX
Tianbao Xie
3 months
🚀 OSWorld gets a major upgrade! OSWorld-Verified: 15 months community feedback → 300+ fixes (ambiguity, graders…), 50x faster eval through AWS parallelization More apple-to-apple comparison for reliable CUA evaluation ✨ 👇 https://t.co/4ndsR1JCkz
xlang.ai
We've systematically addressed 300+ issues in OSWorld through a comprehensive refinement process. OSWorld-Verified delivers more reliable evaluation signals through improved infrastructure and...
8
31
149
@xywang626
Xinyuan Wang
3 months
We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] https://t.co/naBIDnyvYY 📌
14
102
466
@_TobiasLee
Lei Li
3 months
🚀 MiMo‑VL 2508 is live! Same size, much smarter. We’ve upgraded performance, thinking control, and overall user experience. 📈 Benchmark gains across image + video: MMMU 70.6, VideoMME 70.8. Consistent improvements across the board. 🤖 Thinking Control: toggle reasoning with
2
16
91
@NiJinjie
Jinjie Ni
3 months
Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens
42
248
2K