Song Han
@songhan_mit
Followers
9K
Following
155
Media
62
Statuses
279
Joined March 2019
Kicking off the journey to NeurIPS! Our groupβs papers focus on sparse attention, efficient video generation, small LLMs, and long-video understanding. We push efficiency to the limit and squeeze every last drop of potential out of GPUs.
1
3
38
π Come check out our Spotlight Poster @Neurips 2025! π Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation π Exhibit Hall C,D,E β #3508 ποΈ Fri, Dec 5 | π 4:30β7:30 PM PST β‘ Sparse VideoGen2 boosts video generation efficiency
arxiv.org
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse...
π Introducing Sparse VideoGen2 (SVG2) β Pareto-frontier video generation acceleration with semantic-aware sparse attention! πSpotlight paper accepted by #NeurIPS2025 β
Training-free & plug-and-play β
Up to 2.5Γ faster on HunyuanVideo, 1.9Γ faster on Wan 2.1 β
SOTA quality
1
6
20
We (@lawrence_cjs, @yuyangzhao_ , @shanasaimoe) from the SANA team just posted a blog on the core of Linear Attention: how it achieves infinite context lengths with global awareness but constant memory usage! We explore state accumulation mechanics, the evolution from Softmax to
4
34
179
π₯³πSana-video inference code has been integrated into diffusers! Thanks to @lawrence_cjs @RisingSayak and the team for making it happen.
huggingface.co
2
8
37
SANA-Video is open sourced:
github.com
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer - NVlabs/Sana
π SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos π₯ Key Features π π§ Linear DiT everywhere β O(N) complexity on video-scale tokens π§° Constant-memory Block KV cache β store cumulative states only (no growing KV) π π― Temporal Mix-FFN + 3D RoPE
3
7
73
π§ At Open Source AI Week we canβt wait to learn how the community is using #opensource projects to redefine how AI is developed, scaled, and shared across text, image, audio, video, and multimodal tasks. To help accelerate innovation, we are now a top contributor on
2
9
76
Thanks AK for sharing. Code is available at
github.com
QeRL enables RL for 32B LLMs on a single H100 GPU. - NVlabs/QeRL
0
10
50
explore the critical role of "attention sink" for both understanding and generation:
The Convergence of βUnderstanding Γ Generationβ in Long Video β Attention Sink β¨π¬π§ We recently open-sourced two works related to long videos: long-video understanding StreamingVLM ( https://t.co/o5MFULkjdR) and long-video generation LongLive ( https://t.co/OAFQSlnlbg). Both
0
2
16
An interesting effect of 4bit RL is that the quantization noise helps exploration and increases the training reward:
We open-sourced QeRL β Quantization-enhanced Reinforcement Learning ! π§ 4-bit quantized RL training πͺ Train a 32B LLM on a single H100 GPU βοΈ 1.7Γ faster overall training π― Accuracy on par with bfloat16-level accuracy π₯ Supports NVFP4 quantization format Moreover, we show
0
7
45
Explore StreamingVLM for understanding infinite video streams:
Excited to share our new work: StreamingVLM! π We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF
3
7
44
Fast dLLM v2 7B is 3x faster than Qwen2.5-7B, achieving the same performance! π Report is available: https://t.co/7FjkYXKLYe
π Fast-dLLM v2: Parallel Block-Diffusion Decoding for LLMs β‘οΈ Highlights π - Blockwise bidirectional context via complementary masks - Hierarchical caches (block + sub-block) - Parallel sub-block decoding + token-shift training Results π - ~2.5Γ faster vs. standard AR
0
9
26
Explore Deep Compression Video Autoencoder, fast training and inference for video generation:
We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: π¬ Supports video generation up to 2160Γ3840 (4K) resolution on a single H100 GPU β‘ Delivers 14.8Γ faster inference than the base model while achieving comparable or
0
0
17
π Jet-Nemotron β Code & pre-trained checkpoints now available! β‘οΈ Achieve up to 53.6Γ higher generation throughput on H100 GPUs with cost-efficient finetuning. π GitHub: https://t.co/XGX7MTMm7J π Hugging Face: https://t.co/AMEGIq5zOp π Paper:
huggingface.co
4
39
173
Explore Deep Compression Generation (DC-Gen), compress the number of tokens and accelerate FLUX by 53x:
Changing the autoencoder in latent diffusion models is easier than you think. π Introducing DC-Gen β a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with
0
1
10
explore our new work, SANA-Video, generating videos at low cost:
π SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos π₯ Key Features π π§ Linear DiT everywhere β O(N) complexity on video-scale tokens π§° Constant-memory Block KV cache β store cumulative states only (no growing KV) π π― Temporal Mix-FFN + 3D RoPE
1
1
26
Explore LongLive for interactive, real-time long-video generation:
π We open-sourced LongLive β interactive, real-time long-video generation. π₯Generates video in real time as users enter text prompts. β‘οΈ20.7 FPS on a single H100,β±οΈup to 240s per clip. π¬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. πOne step
0
0
16
Explore our second iteration on Sparse VideoGen: we don't need full-attention, but sparse attention. Unlike in v1 where we apply rule-based (spatial and temporal) sparsity pattern, in v2 we apply kmeans to cluster similar tokens together, formulate block sparsity patterns, then
π Introducing Sparse VideoGen2 (SVG2) β Pareto-frontier video generation acceleration with semantic-aware sparse attention! πSpotlight paper accepted by #NeurIPS2025 β
Training-free & plug-and-play β
Up to 2.5Γ faster on HunyuanVideo, 1.9Γ faster on Wan 2.1 β
SOTA quality
1
8
115