Haocheng Xi Profile
Haocheng Xi

@HaochengXiUCB

Followers
739
Following
831
Media
29
Statuses
107

First-year PhD in @berkeley_ai. Prev: Yao Class, @Tsinghua_Uni | Efficient Machine Learning & ML sys

Joined August 2024
Don't wanna be here? Send us removal request.
@HaochengXiUCB
Haocheng Xi
1 month
🚀 Introducing Sparse VideoGen2 (SVG2) — Pareto-frontier video generation acceleration with semantic-aware sparse attention! 🏆Spotlight paper accepted by #NeurIPS2025 ✅ Training-free & plug-and-play ✅ Up to 2.5× faster on HunyuanVideo, 1.9× faster on Wan 2.1 ✅ SOTA quality
16
59
261
@rish2k1
Rishabh Tiwari
23 days
There is so much noise in the LLM RL space, so we sat down and ran everything at scale (so you dont have to 😜) and presenting to you “The Art of Scaling RL” Give this a read before starting your next RL run. Led by amazing @Devvrit_Khatri @lovish
@Devvrit_Khatri
Devvrit
23 days
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
3
20
222
@yukangchen_
Yukang Chen
23 days
The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM ( https://t.co/o5MFULkjdR) and long-video generation LongLive ( https://t.co/OAFQSlnlbg). Both
Tweet card summary image
github.com
LongLive: Real-time Interactive Long Video Generation - NVlabs/LongLive
2
12
64
@yukangchen_
Yukang Chen
25 days
We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show
11
68
352
@Guangxuan_Xiao
Guangxuan Xiao
25 days
Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF
31
163
1K
@zkwthu
Kaiwen Zheng
29 days
🚀Try out rCM—the most advanced diffusion distillation! ✅First to scale up sCM/MeanFlow to 10B+ video models ✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support ✅High quality & diversity videos in 2~4 steps Paper: https://t.co/xZZK25oIrJ Code: https://t.co/aPAo1MO0JQ
1
31
179
@Chenfeng_X
Chenfeng_X
1 month
🥳We’re releasing StreamDiffusionV2 for the live-stream community—from individual creators with one GPU to enterprise platforms with many. StreamDiffusionV2 is our follow-up to StreamDiffusion: #StreamDiffusion powered real products, but temporal consistency still bugged us.
12
45
223
@rohanpaul_ai
Rohan Paul
1 month
New @nvidia paper shows how to make text to image models render high resolution images far faster without losing quality. 53x faster 4K on H100, 3.5 seconds on a 5090 with quantization for 138x total speedup. It speeds up by moving generation into a smaller hidden image space.
11
73
420
@hancai_hm
Han Cai
1 month
We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: 🎬 Supports video generation up to 2160×3840 (4K) resolution on a single H100 GPU ⚡ Delivers 14.8× faster inference than the base model while achieving comparable or
2
28
145
@hancai_hm
Han Cai
1 month
Changing the autoencoder in latent diffusion models is easier than you think. 🚀 Introducing DC-Gen – a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with
5
38
223
@xieenze_jr
Enze Xie
1 month
🚀 SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos 💥 Key Features 🌟 🧠 Linear DiT everywhere → O(N) complexity on video-scale tokens 🧰 Constant-memory Block KV cache → store cumulative states only (no growing KV) 🔄 🎯 Temporal Mix-FFN + 3D RoPE
3
18
119
@hancai_hm
Han Cai
1 month
🚀 Jet-Nemotron – Code & pre-trained checkpoints now available! ⚡️ Achieve up to 53.6× higher generation throughput on H100 GPUs with cost-efficient finetuning. 🔗 GitHub: https://t.co/XGX7MTMm7J 🔗 Hugging Face: https://t.co/AMEGIq5zOp 🔗 Paper:
Tweet card summary image
arxiv.org
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation...
4
39
174
@yukangchen_
Yukang Chen
1 month
🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step
4
18
79
@HaochengXiUCB
Haocheng Xi
1 month
Efficient Kernels: Our algorithm works in a hardware-efficient manner. We further propose Centroids Cache to reduce the overhead of k-means, utilizing redundancy between timesteps. Our dynamic attention kernel successfully reached the ideal performance on both Ampere and Hopper
0
0
4
@HaochengXiUCB
Haocheng Xi
1 month
Quality Result: We evaluate our method on Hunyuan Video-T2V-13B and Wan2.1-I2V/T2V-14B. SVG2 consistently achieves a superior trade-off between generation quality and efficiency, outperforming all baseline methods.
0
0
3
@HaochengXiUCB
Haocheng Xi
1 month
After creating semantic clusters, SVG2 approximates the importance of a cluster by calculating attention scores using only the cluster's centroid. This provides an accurate estimation of a cluster's importance with less than 1% overhead. With the estimated scores, SVG2 selects
0
0
4
@HaochengXiUCB
Haocheng Xi
1 month
The k-means clustering algorithm effectively identifies which tokens are most similar to each other. This algorithm is applied to the Query (Q) and Key (K) vectors along the token dimension, sorting them into clusters of semantically similar tokens based on their hidden
0
0
4
@HaochengXiUCB
Haocheng Xi
1 month
Semantic-Aware Permutation directly solves these two main flaws by grouping tokens by their semantic meaning. This turns a scattered, inefficient problem into a dense one that is ideal for GPUs. Query and Key/Value adopt different permutations for better performance. (4/6)
0
0
6
@HaochengXiUCB
Haocheng Xi
1 month
Existing sparse methods aren't good enough for two main reasons: 1. Inaccurate identification – they cluster tokens by position instead of meaning, often grouping unrelated ones together. 2. Computation waste – critical tokens end up scattered in memory, forcing GPUs to calculate
0
0
5
@HaochengXiUCB
Haocheng Xi
1 month
Attention computation is naturally sparse. Only a small fraction (e.g., 13%) of computations (which we call "critical tokens") significantly impact the final output. If we can compute only these critical parts, we can achieve a dramatic speedup without sacrificing much quality.
0
0
6