Haocheng Xi
@HaochengXiUCB
Followers
739
Following
831
Media
29
Statuses
107
First-year PhD in @berkeley_ai. Prev: Yao Class, @Tsinghua_Uni | Efficient Machine Learning & ML sys
Joined August 2024
🚀 Introducing Sparse VideoGen2 (SVG2) — Pareto-frontier video generation acceleration with semantic-aware sparse attention! 🏆Spotlight paper accepted by #NeurIPS2025 ✅ Training-free & plug-and-play ✅ Up to 2.5× faster on HunyuanVideo, 1.9× faster on Wan 2.1 ✅ SOTA quality
16
59
261
There is so much noise in the LLM RL space, so we sat down and ran everything at scale (so you dont have to 😜) and presenting to you “The Art of Scaling RL” Give this a read before starting your next RL run. Led by amazing @Devvrit_Khatri @lovish
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
3
20
222
The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM ( https://t.co/o5MFULkjdR) and long-video generation LongLive ( https://t.co/OAFQSlnlbg). Both
github.com
LongLive: Real-time Interactive Long Video Generation - NVlabs/LongLive
2
12
64
We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show
11
68
352
Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF
31
163
1K
🚀Try out rCM—the most advanced diffusion distillation! ✅First to scale up sCM/MeanFlow to 10B+ video models ✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support ✅High quality & diversity videos in 2~4 steps Paper: https://t.co/xZZK25oIrJ Code: https://t.co/aPAo1MO0JQ
1
31
179
🥳We’re releasing StreamDiffusionV2 for the live-stream community—from individual creators with one GPU to enterprise platforms with many. StreamDiffusionV2 is our follow-up to StreamDiffusion: #StreamDiffusion powered real products, but temporal consistency still bugged us.
12
45
223
New @nvidia paper shows how to make text to image models render high resolution images far faster without losing quality. 53x faster 4K on H100, 3.5 seconds on a 5090 with quantization for 138x total speedup. It speeds up by moving generation into a smaller hidden image space.
11
73
421
We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: 🎬 Supports video generation up to 2160×3840 (4K) resolution on a single H100 GPU ⚡ Delivers 14.8× faster inference than the base model while achieving comparable or
2
28
145
Changing the autoencoder in latent diffusion models is easier than you think. 🚀 Introducing DC-Gen – a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with
5
38
223
🚀 SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos 💥 Key Features 🌟 🧠 Linear DiT everywhere → O(N) complexity on video-scale tokens 🧰 Constant-memory Block KV cache → store cumulative states only (no growing KV) 🔄 🎯 Temporal Mix-FFN + 3D RoPE
3
18
119
🚀 Jet-Nemotron – Code & pre-trained checkpoints now available! ⚡️ Achieve up to 53.6× higher generation throughput on H100 GPUs with cost-efficient finetuning. 🔗 GitHub: https://t.co/XGX7MTMm7J 🔗 Hugging Face: https://t.co/AMEGIq5zOp 🔗 Paper:
arxiv.org
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation...
4
39
174
🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step
4
18
79
Efficient Kernels: Our algorithm works in a hardware-efficient manner. We further propose Centroids Cache to reduce the overhead of k-means, utilizing redundancy between timesteps. Our dynamic attention kernel successfully reached the ideal performance on both Ampere and Hopper
0
0
4
Quality Result: We evaluate our method on Hunyuan Video-T2V-13B and Wan2.1-I2V/T2V-14B. SVG2 consistently achieves a superior trade-off between generation quality and efficiency, outperforming all baseline methods.
0
0
3
After creating semantic clusters, SVG2 approximates the importance of a cluster by calculating attention scores using only the cluster's centroid. This provides an accurate estimation of a cluster's importance with less than 1% overhead. With the estimated scores, SVG2 selects
0
0
4
The k-means clustering algorithm effectively identifies which tokens are most similar to each other. This algorithm is applied to the Query (Q) and Key (K) vectors along the token dimension, sorting them into clusters of semantically similar tokens based on their hidden
0
0
4
Semantic-Aware Permutation directly solves these two main flaws by grouping tokens by their semantic meaning. This turns a scattered, inefficient problem into a dense one that is ideal for GPUs. Query and Key/Value adopt different permutations for better performance. (4/6)
0
0
6
Existing sparse methods aren't good enough for two main reasons: 1. Inaccurate identification – they cluster tokens by position instead of meaning, often grouping unrelated ones together. 2. Computation waste – critical tokens end up scattered in memory, forcing GPUs to calculate
0
0
5
Attention computation is naturally sparse. Only a small fraction (e.g., 13%) of computations (which we call "critical tokens") significantly impact the final output. If we can compute only these critical parts, we can achieve a dramatic speedup without sacrificing much quality.
0
0
6