Beidi Chen
@BeidiChen
Followers
15K
Following
1K
Media
35
Statuses
540
Asst. Prof @CarnegieMellon, @amazon Scholar, Prev: Visiting Researcher @Meta, Postdoc @Stanford, Ph.D. @RiceUniversity, Large-Scale ML, a fan of Dota2.
Joined November 2011
On-Policy Distillation with Reverse KL — sounds like Self-Forcing + DMD for language modeling 😀 Maybe a bidirectional (diffusion) teacher would make it even better?
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
2
10
85
📣 we study a threat model that users intent to leverage llm agent to fix problems in the code base but the agent could just insert vulnerabilities in while passes all the tests — I think security would be a more and more important problem when agents ability grows. So much fun
🚀If your code agent generates a patch that passes all tests, should you trust it merge automatically? ⚠️You probably shouldn’t! “Correct” ≠ “Safe.” In our study we show that a single normal looking issue description, whether from a benign user or not, can lead code agents
0
3
29
Happy to see the effectiveness of sparse FT in balancing new information and old knowledge. We have proposed S2FT ( https://t.co/wkvTSket4h) with a similar motivation one year ago, and I believe the introduction of memory layer leads to better continual learning!
arxiv.org
Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new...
🧠 How can we equip LLMs with memory that allows them to continually learn new things? In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge. While full
3
4
24
Wow congrats on the release! A very important step towards self-improving kernel agent 😉
🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving
0
7
44
Congrats!!! So honored to be part of the team 🎉 Haha, first time making contribution in the computer architecture field — thanks for carrying me 🙏
2
0
46
📢🔥 New off-policy RL for LLMs — now training 32B model with 200+ stale steps for the first time, while still matching on-policy accuracy 💪 A big step toward scalable & decentralized agent training 😉
🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and
4
19
213
🚀 Excited to share that #Multiverse has been accepted to #NeurIPS 2025! Couldn’t have done it without such incredible collaborators—thank you!!
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: https://t.co/J9osByhWUf 🧵 1/n
1
4
22
[#TI2025 Summary ] In the just concluded The International 2025 finals, we lost to the strong opponent Falcons with a score of 2-3, and finally won the runner-up of this Ti. This year has been a bumpy one for us. Results fluctuated, roster changes occurred, and we've
241
436
4K
Introducing DeepConf: Deep Think with Confidence 🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong
63
333
2K
🎉 glad to see our attention sink is widely adopted and contribute to the strong open source models ~ please check out this post by @Guangxuan_Xiao on many insights and hypothesis. It would be interesting for folks who’ve seen artifacts / outliers on generated content and model
I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: https://t.co/0EAi2KQMMx
3
6
129
🤖 GPT-5 supports 128K output / 400K input tokens. 📜 Wiles’s Fermat proof took ~88K tokens — the final output only. 🧩 Add years of exploration, likely >880K tokens of reasoning. 🧠 Real intelligence isn’t about making it short — it’s about exploring the sparsity in the logic.
0
2
8
The release of GPT-OSS-120B & GPT-OSS-20B models today incorporates my Attention Sink work ( https://t.co/u67QTC3rzh). Exciting to see this come to life! 🎉 Looking forward to more progress in this space. 😁
Our open models are here. Both of them. https://t.co/9tFxefOXcg
18
50
735
Big Congrats @Anshumali_ 🎈🎉
Congrats to Rice CS' @Anshumali_ Shrivastava, who has been promoted to full professor. Shrivastava is well on his way to revolutionizing how LLMs & other deep learning models are trained & stored, using new algorithms to make AI scalable & more accessible. https://t.co/8VpFk371gp
1
0
27
(1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as “sparse distillation”, to speed up video denoising time by 70X! 🖥️ Live
10
100
422
Tired intricate system code for RL training? 🤯 We release AReaL-lite – A lightweight AReaL version for AI researchers! 🚀#opensource ✨ Algorithm-first design & APIs🎉 ✨ 80% less code w. 90% AReaL's full efficiency 🎉 ✨ Customizable agentic RL🎉 🔗 https://t.co/YUa03pp9LR
3
26
70
🥳
Huge thanks to @tinytitans_icml for an amazing workshop — see you next year! Honored to receive a Best Paper Award 🏆 Let’s unlock the potential of sparsity! Next up: scaling to hundreds/thousands of rollouts? Or making powerful R1/K2-level LLMs (not just 8B 4-bit models) run
8
5
145
I will be in front of the GSM-Infinite poster tomorrow 2-4:30 pm. 🫡 East Exhibition Hall E-2901. Please come and say hi. Happy to chat about LLM evals, synthetic data, and more!
🚨 Super excited to share that our paper "GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?" got accepted at ICML 2025! 🧠📏🎉 P.S. 😅 Struggling to make your long-context LLM submission stand out at NeurIPS? 🧠 Give
0
2
12
Beginner Q: Anyone knows details why Ray doesn’t support ipv6? Was debugging verl on a cluster and found the root cause was ipv6 with Ray … it seems to be a known issue for a while but never get resolved?
4
0
9