Wenhao Chai
            
            @wenhaocha1
Followers
                2K
              Following
                7K
              Media
                55
              Statuses
                732
              Ph.D. Student @PrincetonCS with @liuzhuang1234. Prev @Stanford @UW @pika_labs @MSFTResearch @UofIllinois. I work on computer vision and more.
              
              Princeton, NJ (NYC at times)
            
            
              
              Joined January 2022
            
            
          
            @danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug
          
          
              @_arohan_ :) Original plots come from  https://t.co/KOBqOoaeLq  - also their blog is super good! - still unsure if the FP16 vs BF16 debate is due to hardware issues due to FP32 accumulation sizes - planning to run some experiments!
            
          
                
                8
              
              
                
                24
              
              
                
                338
              
             Had fun contributing a bit to this project! I especially liked this - masked diffusion (any-order generation) can be better than fixed-order AR on problems without a canonical ordering 
           I am excited to share a work we did in the Discovery team at @GoogleDeepMind using RL and generative models to discover creative chess puzzles 🔊♟️♟️ #neurips2025 🎨While strong chess players intuitively recognize the beauty of a position, articulating the precise elements that 
            
                
                2
              
              
                
                5
              
              
                
                50
              
             How to organize your talk? I used to present like this, thinking that I was being "academic", "organized", and "professional". BUT, from the audience's viewpoints, this sucks. 😱 Look how far they need to hold a long-term context to just make sense of what you're saying! 
          
                
                5
              
              
                
                24
              
              
                
                338
              
             Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size. 
          
                
                20
              
              
                
                117
              
              
                
                573
              
             Might also be interested in checking our TARFlow series! TARFlow:  https://t.co/Gb7NETqEw2  ICML2025 Oral STARFlow:  https://t.co/bpkY7SYx4z  NeurIPS2025 Spotlight TARFlow-LM:  https://t.co/BLHoXt9m5Q  NeurIPS 2025 … and more maybe soon🤖 
          
            
            arxiv.org
              Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their...
            
                
                0
              
              
                
                10
              
              
                
                100
              
             Please steal my AI research ideas. This is a list of research questions and concrete experiments I would love to see done, but don't have bandwidth to get to. If you are looking to break into AI research (e.g. as an undergraduate, or a software engineer in industry), these are 
          
                
                47
              
              
                
                203
              
              
                
                2K
              
            
            @wenhaocha1 Thanks, Wenhao! Really appreciated your recognition, and really lucky to meet you back in the early days when we were all starting developing multimodal models - so many new models, datasets, and discussions, bringing many new insights to everyone. From lmms-eval to
          
          
                
                0
              
              
                
                1
              
              
                
                2
              
             Back in 2024, LMMs-Eval built a complete evaluation ecosystem for the MLLM/LMM community, with countless researchers contributing their models and benchmarks to raise the whole edifice. I was fortunate to be one of them: our series of video-LMM works (MovieChat, AuroraCap, VDC) 
           Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full 
            
                
                2
              
              
                
                3
              
              
                
                29
              
             Stanford NLP 25th Anniversary🤩🤩🤩 
           Today, we’re overjoyed to have a 25th Anniversary Reunion of @stanfordnlp. So happy to see so many of our former students back at @Stanford. And thanks to @StanfordHAI for the venue! 
            
                
                9
              
              
                
                39
              
              
                
                600
              
             Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full 
          
                
                9
              
              
                
                33
              
              
                
                104
              
             end-to-end training just makes latent diffusion transformers better! with repa-e, we showed the power of end-to-end training on imagenet. today we are extending it to text-to-image (T2I) generation. #ICCV2025 🌴 🚨 Introducing "REPA-E for T2I: family of end-to-end tuned VAEs for 
          
                
                1
              
              
                
                17
              
              
                
                42
              
             Congrats to FlowEdit for winning #ICCV2025 Best Student Paper. “Inversion-free” is a very cool idea. We proposed the first inversion-free, optimization-free, and model-agnostic framework (for latent diffusion and consistency models) back at CVPR 2024 (  https://t.co/zMrIfyVFpq). 
          
           Want to edit your image with language descriptions in less than 3s? Ever questioned the need for prolonged inversion in text-guided editing? We are happy to release ♾ InfEdit (with demo), a flexible framework for fast, faithful and consistent editing. 🔗  https://t.co/NwZvoEh7ho 
            
            
                
                4
              
              
                
                42
              
              
                
                297
              
             I’ve always wanted to write an open-notebook research blog to (i) show the chain of thought behind how we formed hypotheses, designed experiments, and articulated findings, and (ii) lay out all the intermediate results that did not make it into the final paper, including negative 
          
                
                4
              
              
                
                42
              
              
                
                221
              
             Our paper Video-MMLU has been awarded Outstanding Paper at the ICCV Workshop! I happened to receive this wonderful news while soaking in the water couldn’t be happier! Huge thanks to the Knowledge-Intensive Multimodal Reasoning Workshop Committee for the honor. 
           🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs. 
            
                
                4
              
              
                
                7
              
              
                
                79
              
             Paper:  https://t.co/wvMxdVTvHl  Blog:  https://t.co/2TPDgzrygX  (you can also see updated LiveCodeBench Pro leaderboard) How to start evaluation:  https://t.co/zi441pmL9z  Amazing work led by @ZhouShang for AutoCode and @ZihanZheng71803 for Eval toolkit. 
          
            
            arxiv.org
              Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow,...
            
                
                0
              
              
                
                3
              
              
                
                11
              
             LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box. We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training! For the first time, 
          
                
                4
              
              
                
                29
              
              
                
                124
              
             So excited to be part of the team bringing the 1st Multimodal Spatial Intelligence (MUSI) workshop to @ICCVConference, with a huge shout-out to @songyoupeng for leading the effort! We've put together an incredible program. If you'll be at ICCV, you should definitely stop by! 🗓️ 
           📣 Announcing MUSI: 1st Multimodal Spatial Intelligence Workshop @ICCVConference! 🎙️All-star keynotes: @sainingxie, @ManlingLi_, @RanjayKrishna, @yuewang314, and @QianqianWang5 - plus a panel on the future of the field! 🗓 Oct 20, 1pm-5:30pm HST 🔗  https://t.co/wZaWKRIcYI 
            
            
                
                0
              
              
                
                6
              
              
                
                29
              
             The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features 
           three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n) 
            
                
                13
              
              
                
                44
              
              
                
                486