Chenyu (Monica) Wang @ChenyuW64562111 X Profile

Chenyu (Monica) Wang

@ChenyuW64562111

Followers

885

Following

397

Media

25

Statuses

106

PhD @MIT_CSAIL | Prev @AIatMeta @genentech @Tsinghua_Uni

https://t.co/ubMAU7sjKm

Cambridge, MA

Joined September 2022

Don't wanna be here? Send us removal request.

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

Introducing SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models We propose a new policy gradient algorithm, SPG, for diffusion large language models. SPG improves the accuracy over the previous state-of-the-art RL methods by 3.6% in GSM8K, 2.6% in MATH500, 18.4%

3

28

123

Bo Liu

@cranialxix

1 day

Proud to be under Yuandong for the past year : ) Just realized Yd got deactivated before he could see my reply under his badge post: You reminded me why I chose this path: seeking the truth of natural/intelligence is the most beautiful and impactful thing we can do. Yuandong,

Yuandong Tian

@tydsh

2 days

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

3

7

244

Chenyu (Monica) Wang

@ChenyuW64562111

2 days

So sad to hear this. It was a very wonderful time working with Yuandong and the team over the summer. Wish all the best and hope our path will cross once again!

Yuandong Tian

@tydsh

2 days

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

0

11

Yuandong Tian

@tydsh

2 days

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

468

289

6K

Alex Gu

@minimario1729

5 days

✂️Introducing ProofOptimizer: a training and inference recipe for proof shortening! 😰AI-written formal proofs can be long and unreadable: Seed-Prover's proof of IMO '25 P1 is 16x longer in Lean vs. English. Our 7B shortens proofs generated by SoTA models by over 50%! 🧵⬇️

6

35

201

Cai Zhou

@zhuci19

5 days

(1/6) Check out our new paper: Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model A Latent Reasoner! arxiv: https://t.co/ldDqxufyG5 Do diffusion language models (DLMs) need to be discrete? No! We show that continuous diffusion models are more

arxiv.org

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...

2

29

101

Yuandong Tian

@tydsh

6 days

🚨🚨Great work from our intern @ChenyuW64562111! Our proposed SPG (Sandwiched Policy Gradient) is based on a very simple intuition: RL has both pos/neg samples and we need to learn with both upper/lower bound of the log-likelihood of the text diffusion model. Strong

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

Introducing SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models We propose a new policy gradient algorithm, SPG, for diffusion large language models. SPG improves the accuracy over the previous state-of-the-art RL methods by 3.6% in GSM8K, 2.6% in MATH500, 18.4%

3

13

113

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

[5/n] Credit to my great coauthors @paria_rd DiJia Andy Su @songjiang24 Sid Wang @siyan_zhao @zhuci19 @shannonzshen Feiyu Chen Tommi Jaakkola @tydsh @cranialxix 🔗For more results, please refer to our paper https://t.co/e9OhHTqFhN. 💻The code has also been released:

arxiv.org

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with...

0

2

8

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

[4/n] Despite being trained under confidence-based semi-AR decoding and block-wise masking, SPG generalizes quite well to different inference strategies, even non semi-AR ones.

1

5

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

[3/n] As shown in the reward dynamics throughout training, SPG shows a rapid and steady increase in reward over the optimization steps, further demonstrating its efficiency and robustness.

1

4

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

[2/n] SPG consistently outperforms the previous state-of-the-art RL methods to dLLMs across mathematical reasoning (GSM8K, MATH500) and logical reasoning (Countdown, Sudoku) benchmarks. Specifically, SPG improves the accuracy over the previous state-of-the-art by 3.6% in GSM8K,

1

3

Chenyu (Monica) Wang

@ChenyuW64562111

7 days

[1/n] dLLMs are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, applying RL algorithms to dLLMs is challenging because of their intractable log-likelihood. SPG computes a more robust and less

1

8

Chenyu (Monica) Wang

@ChenyuW64562111

11 days

Diffusion training can largely benefit from a good representation space. If you enjoy @sainingxie's RAE, you may also check our REED paper👇 In our @NeurIPSConf 2025 paper, we find that such benefit can also comes from the representation of a different (synthetic) modality (eg.

arxiv.org

Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of...

Chenyu (Monica) Wang

@ChenyuW64562111

3 months

Excited to share: “Learning Diffusion Models with Flexible Representation Guidance” With my amazing coauthors @zhuci19, @sharut_gupta, @zy27962986, @StefanieJegelka, @stats_stephen, Tommi Jaakkola Paper: https://t.co/wYbm5bAlZv Code: https://t.co/nbO1seYBvp

2

18

158

Guangxuan Xiao

@Guangxuan_Xiao

11 days

Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF

31

162

1K

Chenyu (Monica) Wang

@ChenyuW64562111

12 days

Thanks for featuring our work!

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

12 days

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models "we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood." "SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in

0

1

12

Chenyu (Monica) Wang

@ChenyuW64562111

12 days

Check our NeurIPS 2025 paper on next semantic scale prediction for language modeling 😎 We enable self-correction capability by introducing hierarchical semantic representations between the mask and word token. See more details in @zhuci19’s post and https://t.co/BzJmbrgQHb🎉

arxiv.org

In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where...

Cai Zhou

@zhuci19

12 days

(1/5) Beyond Next-Token Prediction, introducing Next Semantic Scale Prediction! Our @NeurIPSConf NeurIPS 2025 paper HDLM is out! Check out the new language modeling paradigm: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models. It largely generalizes

0

2

22

Dinghuai Zhang 张鼎怀

@zdhnarsil

14 days

Our recent CCDD paper on discrete language modeling is out: 📚Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner https://t.co/ChoCrMuIs3

arxiv.org

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...

4

26

112

Sander Dieleman

@sedielem

15 days

In diffusion LMs, discrete methods have all but displaced continuous ones (🥲). Interesting new trend: why not both? Use continuous methods to make discrete diffusion better. Diffusion duality: https://t.co/KPO56vDygp CADD: https://t.co/CNOIWcUIMo CCDD:

arxiv.org

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages...

Sander Dieleman

@sedielem

2 months

New survey on diffusion language models: https://t.co/SHicf69gxV (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲

9

73

423

Sharut Gupta

@sharut_gupta

15 days

[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ @shobsund @ChenyuW64562111, Stefanie Jegelka and @phillip_isola

10

52

433

Peter Holderrieth

@peholderrieth

16 days

New work: “GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models”. GLASS generates images by sampling stochastic Markov transitions with ODEs - allowing us to boost text-image alignment for large-scale models at inference time! https://t.co/unsuG3mYer [1/7]

3

58

240