Song Han
@songhan_mit
Followers
9K
Following
154
Media
61
Statuses
274
Joined March 2019
SANA-Video is open sourced:
github.com
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer - NVlabs/Sana
π SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos π₯ Key Features π π§ Linear DiT everywhere β O(N) complexity on video-scale tokens π§° Constant-memory Block KV cache β store cumulative states only (no growing KV) π π― Temporal Mix-FFN + 3D RoPE
0
2
18
π§ At Open Source AI Week we canβt wait to learn how the community is using #opensource projects to redefine how AI is developed, scaled, and shared across text, image, audio, video, and multimodal tasks. To help accelerate innovation, we are now a top contributor on
2
9
77
Thanks AK for sharing. Code is available at
github.com
QeRL enables RL for 32B LLMs on a single H100 GPU. - NVlabs/QeRL
0
10
48
explore the critical role of "attention sink" for both understanding and generation:
The Convergence of βUnderstanding Γ Generationβ in Long Video β Attention Sink β¨π¬π§ We recently open-sourced two works related to long videos: long-video understanding StreamingVLM ( https://t.co/o5MFULkjdR) and long-video generation LongLive ( https://t.co/OAFQSlnlbg). Both
0
2
16
An interesting effect of 4bit RL is that the quantization noise helps exploration and increases the training reward:
We open-sourced QeRL β Quantization-enhanced Reinforcement Learning ! π§ 4-bit quantized RL training πͺ Train a 32B LLM on a single H100 GPU βοΈ 1.7Γ faster overall training π― Accuracy on par with bfloat16-level accuracy π₯ Supports NVFP4 quantization format Moreover, we show
0
7
44
Explore StreamingVLM for understanding infinite video streams:
Excited to share our new work: StreamingVLM! π We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: https://t.co/G0bfwKCdZm Code: https://t.co/HqBoLMcrJF
2
7
44
Fast dLLM v2 7B is 3x faster than Qwen2.5-7B, achieving the same performance! π Report is available: https://t.co/7FjkYXKLYe
π Fast-dLLM v2: Parallel Block-Diffusion Decoding for LLMs β‘οΈ Highlights π - Blockwise bidirectional context via complementary masks - Hierarchical caches (block + sub-block) - Parallel sub-block decoding + token-shift training Results π - ~2.5Γ faster vs. standard AR
0
9
25
Explore Deep Compression Video Autoencoder, fast training and inference for video generation:
We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: π¬ Supports video generation up to 2160Γ3840 (4K) resolution on a single H100 GPU β‘ Delivers 14.8Γ faster inference than the base model while achieving comparable or
0
0
17
π Jet-Nemotron β Code & pre-trained checkpoints now available! β‘οΈ Achieve up to 53.6Γ higher generation throughput on H100 GPUs with cost-efficient finetuning. π GitHub: https://t.co/XGX7MTMm7J π Hugging Face: https://t.co/AMEGIq5zOp π Paper:
arxiv.org
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation...
4
39
174
Explore Deep Compression Generation (DC-Gen), compress the number of tokens and accelerate FLUX by 53x:
Changing the autoencoder in latent diffusion models is easier than you think. π Introducing DC-Gen β a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with
0
1
9
explore our new work, SANA-Video, generating videos at low cost:
π SANA-Video: Linear Attention + Constant-Memory KV Cache = Fast Long Videos π₯ Key Features π π§ Linear DiT everywhere β O(N) complexity on video-scale tokens π§° Constant-memory Block KV cache β store cumulative states only (no growing KV) π π― Temporal Mix-FFN + 3D RoPE
1
1
26
Explore LongLive for interactive, real-time long-video generation:
π We open-sourced LongLive β interactive, real-time long-video generation. π₯Generates video in real time as users enter text prompts. β‘οΈ20.7 FPS on a single H100,β±οΈup to 240s per clip. π¬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. πOne step
0
0
16
Explore our second iteration on Sparse VideoGen: we don't need full-attention, but sparse attention. Unlike in v1 where we apply rule-based (spatial and temporal) sparsity pattern, in v2 we apply kmeans to cluster similar tokens together, formulate block sparsity patterns, then
π Introducing Sparse VideoGen2 (SVG2) β Pareto-frontier video generation acceleration with semantic-aware sparse attention! πSpotlight paper accepted by #NeurIPS2025 β
Training-free & plug-and-play β
Up to 2.5Γ faster on HunyuanVideo, 1.9Γ faster on Wan 2.1 β
SOTA quality
1
8
117
Explore our second iteration of fast diffusion LLM research: - between blocks: autoregressive. - within a block: parallel decoding with KV cache. - post training technique that converts LLM to dLLM. - same accuracy but faster.
π Fast-dLLM v2: Parallel Block-Diffusion Decoding for LLMs β‘οΈ Highlights π - Blockwise bidirectional context via complementary masks - Hierarchical caches (block + sub-block) - Parallel sub-block decoding + token-shift training Results π - ~2.5Γ faster vs. standard AR
1
14
113
Explore Deep Compression Autoencoder (DC-AE) 1.5 with higher token compression ratio (64x) for faster visual generation:
π Excited to announce DC-AE 1.5! With a spatial compression ratio boosted to f64, it accelerates high-res diffusion models while preserving text-to-image quality. Key innovation: channel-wise latent structure for faster convergence with many latent channels. π Catch us at
1
2
24
Explore Jet-Nemotron, a small and fast LLM:
Developing new LLM architectures is both costly and risky. Our latest project β https://t.co/WhOcpWBozX β offers an effective strategy to address this challenge. Our first result is Jet-Nemotron, a new family of hybrid-architecture language models that outperform state-of-the-art
0
3
28