Han Cai
@hancai_hm
Followers
507
Following
19
Media
6
Statuses
17
Research Scientist, NVIDIA
Cambridge, USA
Joined December 2019
We release DC-VideoGen, a new post-training framework for accelerating video diffusion models. Key features: ๐ฌ Supports video generation up to 2160ร3840 (4K) resolution on a single H100 GPU โก Delivers 14.8ร faster inference than the base model while achieving comparable or
2
28
146
Changing the autoencoder in latent diffusion models is easier than you think. ๐ Introducing DC-Gen โ a post-training acceleration framework that works with any pre-trained diffusion model, boosting efficiency by transferring it into a deeply compressed latent space with
5
38
223
Decoding is often the speed bottleneck in few-step latent diffusion models. ๐ Meet DC-AE-Lite: โก 1.8ร faster decoding than DC-AE ๐ฏ Similar reconstruction quality ๐ Code: https://t.co/c4HhcdhZVV ๐ Pre-trained model: https://t.co/0ktlnfQ7uG Contributors: Dongyun Zou,
0
4
16
๐ Jet-Nemotron โ Code & pre-trained checkpoints now available! โก๏ธ Achieve up to 53.6ร higher generation throughput on H100 GPUs with cost-efficient finetuning. ๐ GitHub: https://t.co/XGX7MTMm7J ๐ Hugging Face: https://t.co/AMEGIq5zOp ๐ Paper:
arxiv.org
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation...
4
39
174
๐ Introducing Sparse VideoGen2 (SVG2) โ Pareto-frontier video generation acceleration with semantic-aware sparse attention! ๐Spotlight paper accepted by #NeurIPS2025 โ
Training-free & plug-and-play โ
Up to 2.5ร faster on HunyuanVideo, 1.9ร faster on Wan 2.1 โ
SOTA quality
16
59
261
๐ Excited to announce DC-AE 1.5! With a spatial compression ratio boosted to f64, it accelerates high-res diffusion models while preserving text-to-image quality. Key innovation: channel-wise latent structure for faster convergence with many latent channels. ๐ Catch us at
1
11
51
Developing new LLM architectures is both costly and risky. Our latest project โ https://t.co/WhOcpWBozX โ offers an effective strategy to address this challenge. Our first result is Jet-Nemotron, a new family of hybrid-architecture language models that outperform state-of-the-art
3
24
106
We just dropped a few new PS3 models, with SOTA performance compared to existing vision encoders such as SigLIP2, C-RADIOv2, AIMv2, InternViT2.5, and Perception Encoder! Coming along with several new VILA-HD models. Check it out๐ Models: https://t.co/UwjpBWpFBj Code:
4
16
85
๐๐ A new kind of efficiency challenge: "Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs" We explore a new frontier: what if the reward doesnโt come from being rightโbut from being fast and right? ๐ https://t.co/sxozRPpHJA ๐
3
14
61
Deep compression in vae is hard, but this paper beautifully explains how to achieve this
6
48
455
๐ SANA 1.5 Update: Inference Scaling Now Open-Source! ๐ ๐ Breakthrough on GenEval benchmark: โข SANA 1.5 + Inference Scaling: 0.81 โ 0.96 (!!) ๐ฏ โข SD 1.5 + Inference Scaling: 0.42 โ 0.87 โฌ๏ธ ๐ซ The secret sauce: 1. Generate n candidates ๐จ 2. Pick top k with NVILA
4
54
210
Next-gen vision pre-trained models shouldnโt be short-sighted. Humans can easily perceive 10K x 10K resolution. But todayโs top vision modelsโlike SigLIP and DINOv2โare still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we
27
154
980
Achieving 2x faster video generation by exploiting spatial-temporal sparsity. It is training-free and open-source!
๐ Introducing #SparseVideoGen: 2x speedup in video generation with HunyuanVideo with high pixel-level fidelity (PSNR = 29)! No training is required, no perceptible difference to the human eye! Blog: https://t.co/3IXnz7PTaC Paper: https://t.co/oEHsk4lnaK Code:
0
0
1
๐ We're are excited to open source an FP8 training technique, COAT: Compressing Optimizer states and Activation for memory-efficient fp8 Training. COAT is accepted by ICLR 2025! FP8 training effectively improves the training efficiency. Deepseek-v3 is a successful example of
9
15
71
Hi everyone, I'm thrilled to announce that you can now try #SANA models in #ComfyUI๐. We show video generation using SANA+CogVideoX. SANA now also supports Chinese and Emoji prompts. If you find SANA useful, weโd be grateful if you could give us a๐at https://t.co/Yu57ikMEQT๐
๐ฅณโก๏ธSANA's code is released, enjoy it and welcome starโญ๏ธ! SANA is an efficient linear DiT that can generate images from 1K to 4K ๐๐จ Code: https://t.co/wl9dfP5qIe Demo: https://t.co/wZdGw2hDwp Highlight: โฉ 20 x smaller & 100x faster than FLUX ๐ปDeployable on laptop GPU
1
5
49
๐ฅณ We are excited to introduce Deep Compression Autoencoder. It dramatically reduces the token number of the latent space, delivering significant training and inference speedup for latent diffusion models. Paper: https://t.co/MiQpP5uFlD Code: https://t.co/qpQ57CjBSI
2
7
35