
Dan Alistarh
@DAlistarh
Followers
2K
Following
78
Media
41
Statuses
142
Professor at IST Austria
Vienna
Joined May 2022
Paper: https://t.co/9jwxD1tU92 We release vLLM & HuggingFace integrations. Code: https://t.co/pvkXTyDlcO Kernels: https://t.co/KjO04HAHyq Credit goes to the team: @RobertoL_Castro, Vage, Denis, @black_samorez and @AshkboosSaleh, with support from @RedHat_AI and @thoefler
github.com
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning - GitHub - IST-DASLab/qutlass: QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
0
4
10
QuTLASS kernels natively support any block-orthogonal rotations: - Weight rotations are directly applied before quantization - Activation micro-rotations applied efficiently at runtime - Works for both NVFP4 and MXFP4 on Blackwell - Currently, MXFP4 has a speed advantage (~10%)
1
0
4
Key findings (W&A quantization): - NVFP4 outperforms MXFP4 for round-to-nearest (RTN); - For RTN, rotations provably help MXFP4 but not NVFP4; - Micro-rotated (MR) GPTQ helps MXFP4 recover to within 1-2% of NVFP4, boosting NVFP4 too. - Large models achieve 98-99% recovery.
1
0
3
π We are releasing state-of-the-art post-training quantization (PTQ) algorithms for Microscaling FP4, together with kernels: - First study focused on MXFP4/NVFP4 PTQ for LLMs - New Micro-Rotated (MR) format and GPTQ algorithm - QuTLASS GPU kernels with up to 3.6x speedups.
1
28
150
Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: https://t.co/Lm6wkExF92 [1/5]
3
8
29
Credit goes to the main author, Erik Schultheis, and to @karpathy for the original llm.c that inspired us. FP4 training is next. The code is open source and are happy to take contributions. Code: https://t.co/LSuLL5GaKR Discord:
github.com
Quantized LLM training in pure CUDA/C++. Contribute to IST-DASLab/llmq development by creating an account on GitHub.
0
3
15
Real example: - We trained a TinyLLama-quality 1.5B model on 10B ClimbMix tokens using 4x RTX 4090s. - Total time: 40 hours. - Total cost on vast ai: under $50! - Matches TinyLLama on many benchmarks despite 100x less training data - Up to 2x faster than BF16 training.
1
1
7
Technical highlights in this first version: - Tensor-wise FP8 matmuls throughout training (E4M3 format) - Optimized PCIe communication for consumer cards - Offloading fits 14B models on 4x RTX4090s - Fine-grained activation checkpointing - 60-70% of speed-of-light on modern GPUs
1
0
3
Introducing LLM.Q: Quantized LLM training in pure CUDA/C++! With LLM.Q, you can train your own LLM on consumer GPUs with natively quantized matmuls, on single workstations. No datacenter required. Inspired by @karpathy's llm.c, but natively quantized.
3
16
140
Created by our interns Max Kleinegger & Michael Helcig, supported by the entire DASLab team. Get started: Code: https://t.co/lI8kNdJqU4 Based on: GPTQ (Frantar et al., ICLR23) & EvoPress (Sieberling et al., ICML25). Contributors welcome!
github.com
GPTQ and efficient search for GGUF. Contribute to IST-DASLab/gptq-gguf-toolkit development by creating an account on GitHub.
1
4
24
Academic compression research meets practical deployment: β
Deploy larger models on consumer hardware leveraging llama.cpp β
First GPTQβGGUF implementation β
Works with llama.cpp & all GGUF tools β
Accuracy evaluation via eval suite
1
0
13
π Benchmarked against popular Unsloth Dynamic 2.0 GGUFs on Llama 3.1 8B: 3-bit: 13.17 vs 13.75 perplexity (lower=better) 4.88-bit: 11.18 vs 11.23 perplexity Zero-shot tasks: comparable or superior Matching or improving on SOTA at every bitwidth tested.
1
0
11
How it works: We use EvoPress (ICML25) evolutionary search to discover optimal per-layer configurations, together with GPTQ for quantization. E.g., attention layers might need 6 bits while FFN layers compress to 3 bits - all automatically optimized.
1
0
21
We're releasing the DASLab GGUF Quantization Toolkit! π First open-source toolkit bringing GPTQ + EvoPress to @ggerganov's GGUF format, enabling heterogeneous quantization based on importance. Result: Better models at the same file size. [1/5]
4
50
270
In short, QuTLASS v0.1 makes microscaling practical for NVFP4 inference on Blackwell! π Try it here: https://t.co/KjO04HAHyq or directly in vLLM here: https://t.co/vlRlRErAG0 QuTLASS is driven by @RobertoL_Castro with help from @black_samorez and the entire DASLab team!
github.com
Purpose This pull request brings in the QuTLASS library: https://github.com/iST-DASLab/qutlass QuTLASS is a high-performance library designed for low-precision kernel support in deep learning quant...
1
1
10
π Performance - Benchmarks on Qwen3-32B (RTX5090) & Llama-3.1-70B (B200) showing end-to-end speedup - MXFP4 & NVFP4 kernels deliver near-optimal throughput - Up to 98% recovery vs FP16 on standard tasks (e.g. MMLU) - Experimental models at https://t.co/5KAG6VfDmC
1
0
5
π§© Usability and quantization options - Abs-Max Scaling - MSE / Quartet / Quest-like scaling - Multi-size rotations (16/32/64/128) for MXFP4 & NVFP4 - Seamless integration with vLLM (PR #24440), see below.
1
0
5
β¨ Whatβs new in QuTLASS v0.1.0: πΉ Support for NVIDIA B200 πΉ NVFP4 microscaling with full W4A4 quantization πΉ Online rotations: fused transform + quantization + scaling πΉ Runtime-loaded rotation matrices (flexible transforms!)
1
0
6
π Excited to announce QuTLASS v0.1.0 π QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]
3
34
221
The Geometry of LLM Quantization GPTQ as Babai's Nearest Plane Algorithm
2
9
29