wenhaocha1 Profile Banner
Wenhao Chai Profile
Wenhao Chai

@wenhaocha1

Followers
2K
Following
7K
Media
55
Statuses
732

Ph.D. Student @PrincetonCS with @liuzhuang1234. Prev @Stanford @UW @pika_labs @MSFTResearch @UofIllinois. I work on computer vision and more.

Princeton, NJ (NYC at times)
Joined January 2022
Don't wanna be here? Send us removal request.
@RichardYRLi
Yingru Li
2 days
@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug
@danielhanchen
Daniel Han
2 days
@_arohan_ :) Original plots come from https://t.co/KOBqOoaeLq - also their blog is super good! - still unsure if the FP16 vs BF16 debate is due to hardware issues due to FP32 accumulation sizes - planning to run some experiments!
8
24
338
@thjashin
Jiaxin Shi
5 days
Had fun contributing a bit to this project! I especially liked this - masked diffusion (any-order generation) can be better than fixed-order AR on problems without a canonical ordering
@TZahavy
Tom Zahavy
6 days
I am excited to share a work we did in the Discovery team at @GoogleDeepMind using RL and generative models to discover creative chess puzzles 🔊♟️♟️ #neurips2025 🎨While strong chess players intuitively recognize the beauty of a position, articulating the precise elements that
2
5
50
@jbhuang0604
Jia-Bin Huang
6 days
How to organize your talk? I used to present like this, thinking that I was being "academic", "organized", and "professional". BUT, from the audience's viewpoints, this sucks. 😱 Look how far they need to hold a long-term context to just make sense of what you're saying!
5
24
338
@RidgerZhu
Rui-Jie (Ridger) Zhu
5 days
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
20
117
573
@thoma_gu
Jiatao Gu
6 days
Might also be interested in checking our TARFlow series! TARFlow: https://t.co/Gb7NETqEw2 ICML2025 Oral STARFlow: https://t.co/bpkY7SYx4z NeurIPS2025 Spotlight TARFlow-LM: https://t.co/BLHoXt9m5Q NeurIPS 2025 … and more maybe soon🤖
Tweet card summary image
arxiv.org
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their...
@jm_alexia
Alexia Jolicoeur-Martineau
6 days
Normalizing Flow is back!
0
10
100
@tanishqkumar07
Tanishq Kumar
7 days
Please steal my AI research ideas. This is a list of research questions and concrete experiments I would love to see done, but don't have bandwidth to get to. If you are looking to break into AI research (e.g. as an undergraduate, or a software engineer in industry), these are
47
203
2K
@BoLi68567011
Brian Bo Li
8 days
@wenhaocha1 Thanks, Wenhao! Really appreciated your recognition, and really lucky to meet you back in the early days when we were all starting developing multimodal models - so many new models, datasets, and discussions, bringing many new insights to everyone. From lmms-eval to
0
1
2
@wenhaocha1
Wenhao Chai
8 days
Back in 2024, LMMs-Eval built a complete evaluation ecosystem for the MLLM/LMM community, with countless researchers contributing their models and benchmarks to raise the whole edifice. I was fortunate to be one of them: our series of video-LMM works (MovieChat, AuroraCap, VDC)
@BoLi68567011
Brian Bo Li
10 days
Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full
2
3
29
@wenhaocha1
Wenhao Chai
8 days
world model
@Tesla
Tesla
10 days
To push self-driving into situations wilder than reality, we built a neural network world simulator that can create entirely synthetic worlds for the Tesla to drive in. Video below is fully generated & not a real video
5
9
210
@Diyi_Yang
Diyi Yang
10 days
Stanford NLP 25th Anniversary🤩🤩🤩
@stanfordnlp
Stanford NLP Group
10 days
Today, we’re overjoyed to have a 25th Anniversary Reunion of @stanfordnlp. So happy to see so many of our former students back at @Stanford. And thanks to @StanfordHAI for the venue!
9
39
600
@BoLi68567011
Brian Bo Li
10 days
Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full
9
33
104
@markchen90
Mark Chen
13 days
@josh1yan I joined @OpenAI as a resident. First, get the fundamentals down. If there's one subject you need to know inside and out, it's linear algebra. Read and understand a classic textbook like Bishop's Pattern Recognition and Machine Learning. Then, take on an ambitious project. I
7
30
700
@1jaskiratsingh
Jaskirat Singh @ ICCV2025🌴
13 days
end-to-end training just makes latent diffusion transformers better! with repa-e, we showed the power of end-to-end training on imagenet. today we are extending it to text-to-image (T2I) generation. #ICCV2025 🌴 🚨 Introducing "REPA-E for T2I: family of end-to-end tuned VAEs for
1
17
42
@ziqiao_ma
Martin Ziqiao Ma
14 days
Congrats to FlowEdit for winning #ICCV2025 Best Student Paper. “Inversion-free” is a very cool idea. We proposed the first inversion-free, optimization-free, and model-agnostic framework (for latent diffusion and consistency models) back at CVPR 2024 ( https://t.co/zMrIfyVFpq).
@ziqiao_ma
Martin Ziqiao Ma
2 years
Want to edit your image with language descriptions in less than 3s? Ever questioned the need for prolonged inversion in text-guided editing? We are happy to release ♾ InfEdit (with demo), a flexible framework for fast, faithful and consistent editing. 🔗 https://t.co/NwZvoEh7ho
4
42
297
@ziqiao_ma
Martin Ziqiao Ma
14 days
I’ve always wanted to write an open-notebook research blog to (i) show the chain of thought behind how we formed hypotheses, designed experiments, and articulated findings, and (ii) lay out all the intermediate results that did not make it into the final paper, including negative
4
42
221
@wenhaocha1
Wenhao Chai
15 days
Our paper Video-MMLU has been awarded Outstanding Paper at the ICCV Workshop! I happened to receive this wonderful news while soaking in the water couldn’t be happier! Huge thanks to the Knowledge-Intensive Multimodal Reasoning Workshop Committee for the honor.
@EnxinSong
Enxin Song
6 months
🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑‍🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
4
7
79
@wenhaocha1
Wenhao Chai
19 days
LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box. We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training! For the first time,
4
29
124
@jihanyang13
Jihan Yang
27 days
So excited to be part of the team bringing the 1st Multimodal Spatial Intelligence (MUSI) workshop to @ICCVConference, with a huge shout-out to @songyoupeng for leading the effort! We've put together an incredible program. If you'll be at ICCV, you should definitely stop by! 🗓️
@songyoupeng
Songyou Peng
27 days
📣 Announcing MUSI: 1st Multimodal Spatial Intelligence Workshop @ICCVConference! 🎙️All-star keynotes: @sainingxie, @ManlingLi_, @RanjayKrishna, @yuewang314, and @QianqianWang5 - plus a panel on the future of the field! 🗓 Oct 20, 1pm-5:30pm HST 🔗 https://t.co/wZaWKRIcYI
0
6
29
@TongPetersb
Peter Tong
21 days
The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features
@sainingxie
Saining Xie
21 days
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
13
44
486