Chenyu (Monica) Wang
@ChenyuW64562111
Followers
885
Following
397
Media
25
Statuses
106
PhD @MIT_CSAIL | Prev @AIatMeta @genentech @Tsinghua_Uni
Cambridge, MA
Joined September 2022
Introducing SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models We propose a new policy gradient algorithm, SPG, for diffusion large language models. SPG improves the accuracy over the previous state-of-the-art RL methods by 3.6% in GSM8K, 2.6% in MATH500, 18.4%
3
28
123
Proud to be under Yuandong for the past year : ) Just realized Yd got deactivated before he could see my reply under his badge post: You reminded me why I chose this path: seeking the truth of natural/intelligence is the most beautiful and impactful thing we can do. Yuandong,
3
7
244
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
468
289
6K
✂️Introducing ProofOptimizer: a training and inference recipe for proof shortening! 😰AI-written formal proofs can be long and unreadable: Seed-Prover's proof of IMO '25 P1 is 16x longer in Lean vs. English. Our 7B shortens proofs generated by SoTA models by over 50%! 🧵⬇️
6
35
201
(1/6) Check out our new paper: Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model A Latent Reasoner! arxiv: https://t.co/ldDqxufyG5 Do diffusion language models (DLMs) need to be discrete? No! We show that continuous diffusion models are more
arxiv.org
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...
2
29
101
🚨🚨Great work from our intern @ChenyuW64562111! Our proposed SPG (Sandwiched Policy Gradient) is based on a very simple intuition: RL has both pos/neg samples and we need to learn with both upper/lower bound of the log-likelihood of the text diffusion model. Strong
Introducing SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models We propose a new policy gradient algorithm, SPG, for diffusion large language models. SPG improves the accuracy over the previous state-of-the-art RL methods by 3.6% in GSM8K, 2.6% in MATH500, 18.4%
3
13
113
[5/n] Credit to my great coauthors @paria_rd DiJia Andy Su @songjiang24 Sid Wang @siyan_zhao @zhuci19 @shannonzshen Feiyu Chen Tommi Jaakkola @tydsh @cranialxix 🔗For more results, please refer to our paper https://t.co/e9OhHTqFhN. 💻The code has also been released:
arxiv.org
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with...
0
2
8
[4/n] Despite being trained under confidence-based semi-AR decoding and block-wise masking, SPG generalizes quite well to different inference strategies, even non semi-AR ones.
1
1
5
[3/n] As shown in the reward dynamics throughout training, SPG shows a rapid and steady increase in reward over the optimization steps, further demonstrating its efficiency and robustness.
1
1
4
[2/n] SPG consistently outperforms the previous state-of-the-art RL methods to dLLMs across mathematical reasoning (GSM8K, MATH500) and logical reasoning (Countdown, Sudoku) benchmarks. Specifically, SPG improves the accuracy over the previous state-of-the-art by 3.6% in GSM8K,
1
1
3
[1/n] dLLMs are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, applying RL algorithms to dLLMs is challenging because of their intractable log-likelihood. SPG computes a more robust and less
1
1
8
Diffusion training can largely benefit from a good representation space. If you enjoy @sainingxie's RAE, you may also check our REED paper👇 In our @NeurIPSConf 2025 paper, we find that such benefit can also comes from the representation of a different (synthetic) modality (eg.
arxiv.org
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of...
Excited to share: “Learning Diffusion Models with Flexible Representation Guidance” With my amazing coauthors @zhuci19, @sharut_gupta, @zy27962986, @StefanieJegelka, @stats_stephen, Tommi Jaakkola Paper: https://t.co/wYbm5bAlZv Code: https://t.co/nbO1seYBvp
2
18
158
Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF
31
162
1K
Thanks for featuring our work!
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models "we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood." "SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in
0
1
12
Check our NeurIPS 2025 paper on next semantic scale prediction for language modeling 😎 We enable self-correction capability by introducing hierarchical semantic representations between the mask and word token. See more details in @zhuci19’s post and https://t.co/BzJmbrgQHb🎉
arxiv.org
In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where...
(1/5) Beyond Next-Token Prediction, introducing Next Semantic Scale Prediction! Our @NeurIPSConf NeurIPS 2025 paper HDLM is out! Check out the new language modeling paradigm: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models. It largely generalizes
0
2
22
Our recent CCDD paper on discrete language modeling is out: 📚Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner https://t.co/ChoCrMuIs3
arxiv.org
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...
4
26
112
In diffusion LMs, discrete methods have all but displaced continuous ones (🥲). Interesting new trend: why not both? Use continuous methods to make discrete diffusion better. Diffusion duality: https://t.co/KPO56vDygp CADD: https://t.co/CNOIWcUIMo CCDD:
arxiv.org
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...
New survey on diffusion language models: https://t.co/SHicf69gxV (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲
9
73
423
[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ @shobsund @ChenyuW64562111, Stefanie Jegelka and @phillip_isola
10
52
433
New work: “GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models”. GLASS generates images by sampling stochastic Markov transitions with ODEs - allowing us to boost text-image alignment for large-scale models at inference time! https://t.co/unsuG3mYer [1/7]
3
58
240