Haocheng Xi @HaochengXiUCB X Profile

Haocheng Xi

@HaochengXiUCB

Followers

739

Following

831

Media

29

Statuses

107

First-year PhD in @berkeley_ai. Prev: Yao Class, @Tsinghua_Uni | Efficient Machine Learning & ML sys

https://t.co/7lmYzkZEGu

Joined August 2024

Don't wanna be here? Send us removal request.

Haocheng Xi

@HaochengXiUCB

1 month

🚀 Introducing Sparse VideoGen2 (SVG2) — Pareto-frontier video generation acceleration with semantic-aware sparse attention! 🏆Spotlight paper accepted by #NeurIPS2025 ✅ Training-free & plug-and-play ✅ Up to 2.5× faster on HunyuanVideo, 1.9× faster on Wan 2.1 ✅ SOTA quality

16

59

261

Rishabh Tiwari

@rish2k1

23 days

There is so much noise in the LLM RL space, so we sat down and ran everything at scale (so you dont have to 😜) and presenting to you “The Art of Scaling RL” Give this a read before starting your next RL run. Led by amazing @Devvrit_Khatri @lovish

Devvrit

@Devvrit_Khatri

23 days

Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs

3

20

222

Yukang Chen

@yukangchen_

23 days

The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM ( https://t.co/o5MFULkjdR) and long-video generation LongLive ( https://t.co/OAFQSlnlbg). Both

github.com

LongLive: Real-time Interactive Long Video Generation - NVlabs/LongLive

2

12

64

Yukang Chen

@yukangchen_

25 days

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show

11

68

352

Guangxuan Xiao

@Guangxuan_Xiao

25 days

Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF

31

163

1K

Kaiwen Zheng

@zkwthu

29 days

🚀Try out rCM—the most advanced diffusion distillation! ✅First to scale up sCM/MeanFlow to 10B+ video models ✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support ✅High quality & diversity videos in 2~4 steps Paper: https://t.co/xZZK25oIrJ Code: https://t.co/aPAo1MO0JQ

1

31

179

Chenfeng_X

@Chenfeng_X

1 month

🥳We’re releasing StreamDiffusionV2 for the live-stream community—from individual creators with one GPU to enterprise platforms with many. StreamDiffusionV2 is our follow-up to StreamDiffusion: #StreamDiffusion powered real products, but temporal consistency still bugged us.

12

45

223

Rohan Paul

@rohanpaul_ai

1 month

New @nvidia paper shows how to make text to image models render high resolution images far faster without losing quality. 53x faster 4K on H100, 3.5 seconds on a 5090 with quantization for 138x total speedup. It speeds up by moving generation into a smaller hidden image space.

11

73

421

Han Cai

@hancai_hm

1 month

We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: 🎬 Supports video generation up to 2160×3840 (4K) resolution on a single H100 GPU ⚡ Delivers 14.8× faster inference than the base model while achieving comparable or

2

28

145

Han Cai

@hancai_hm

1 month

Changing the autoencoder in latent diffusion models is easier than you think. 🚀 Introducing DC-Gen – a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with

5

38

223

Enze Xie

@xieenze_jr

1 month

🚀 SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos 💥 Key Features 🌟 🧠 Linear DiT everywhere → O(N) complexity on video-scale tokens 🧰 Constant-memory Block KV cache → store cumulative states only (no growing KV) 🔄 🎯 Temporal Mix-FFN + 3D RoPE

3

18

119

Han Cai

@hancai_hm

1 month

🚀 Jet-Nemotron – Code & pre-trained checkpoints now available! ⚡️ Achieve up to 53.6× higher generation throughput on H100 GPUs with cost-efficient finetuning. 🔗 GitHub: https://t.co/XGX7MTMm7J 🔗 Hugging Face: https://t.co/AMEGIq5zOp 🔗 Paper:

arxiv.org

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation...

4

39

174

Yukang Chen

@yukangchen_

1 month

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step

4

18

79

Haocheng Xi

@HaochengXiUCB

1 month

Efficient Kernels: Our algorithm works in a hardware-efficient manner. We further propose Centroids Cache to reduce the overhead of k-means, utilizing redundancy between timesteps. Our dynamic attention kernel successfully reached the ideal performance on both Ampere and Hopper

0

4

Haocheng Xi

@HaochengXiUCB

1 month

Quality Result: We evaluate our method on Hunyuan Video-T2V-13B and Wan2.1-I2V/T2V-14B. SVG2 consistently achieves a superior trade-off between generation quality and efficiency, outperforming all baseline methods.

0

3

Haocheng Xi

@HaochengXiUCB

1 month

After creating semantic clusters, SVG2 approximates the importance of a cluster by calculating attention scores using only the cluster's centroid. This provides an accurate estimation of a cluster's importance with less than 1% overhead. With the estimated scores, SVG2 selects

0

4

Haocheng Xi

@HaochengXiUCB

1 month

The k-means clustering algorithm effectively identifies which tokens are most similar to each other. This algorithm is applied to the Query (Q) and Key (K) vectors along the token dimension, sorting them into clusters of semantically similar tokens based on their hidden

0

4

Haocheng Xi

@HaochengXiUCB

1 month

Semantic-Aware Permutation directly solves these two main flaws by grouping tokens by their semantic meaning. This turns a scattered, inefficient problem into a dense one that is ideal for GPUs. Query and Key/Value adopt different permutations for better performance. (4/6)

0

6

Haocheng Xi

@HaochengXiUCB

1 month

Existing sparse methods aren't good enough for two main reasons: 1. Inaccurate identification – they cluster tokens by position instead of meaning, often grouping unrelated ones together. 2. Computation waste – critical tokens end up scattered in memory, forcing GPUs to calculate

0

5

Haocheng Xi

@HaochengXiUCB

1 month

Attention computation is naturally sparse. Only a small fraction (e.g., 13%) of computations (which we call "critical tokens") significantly impact the final output. If we can compute only these critical parts, we can achieve a dramatic speedup without sacrificing much quality.

0

6