Dan Alistarh Profile
Dan Alistarh

@DAlistarh

Followers
2K
Following
78
Media
41
Statuses
142

Professor at IST Austria

Vienna
Joined May 2022
Don't wanna be here? Send us removal request.
@DAlistarh
Dan Alistarh
14 days
QuTLASS kernels natively support any block-orthogonal rotations: - Weight rotations are directly applied before quantization - Activation micro-rotations applied efficiently at runtime - Works for both NVFP4 and MXFP4 on Blackwell - Currently, MXFP4 has a speed advantage (~10%)
1
0
4
@DAlistarh
Dan Alistarh
14 days
Key findings (W&A quantization): - NVFP4 outperforms MXFP4 for round-to-nearest (RTN); - For RTN, rotations provably help MXFP4 but not NVFP4; - Micro-rotated (MR) GPTQ helps MXFP4 recover to within 1-2% of NVFP4, boosting NVFP4 too. - Large models achieve 98-99% recovery.
1
0
3
@DAlistarh
Dan Alistarh
14 days
πŸš€ We are releasing state-of-the-art post-training quantization (PTQ) algorithms for Microscaling FP4, together with kernels: - First study focused on MXFP4/NVFP4 PTQ for LLMs - New Micro-Rotated (MR) format and GPTQ algorithm - QuTLASS GPU kernels with up to 3.6x speedups.
1
28
150
@AshkboosSaleh
Saleh Ashkboos
20 days
Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: https://t.co/Lm6wkExF92 [1/5]
3
8
29
@DAlistarh
Dan Alistarh
21 days
Credit goes to the main author, Erik Schultheis, and to @karpathy for the original llm.c that inspired us. FP4 training is next. The code is open source and are happy to take contributions. Code: https://t.co/LSuLL5GaKR Discord:
Tweet card summary image
github.com
Quantized LLM training in pure CUDA/C++. Contribute to IST-DASLab/llmq development by creating an account on GitHub.
0
3
15
@DAlistarh
Dan Alistarh
21 days
Real example: - We trained a TinyLLama-quality 1.5B model on 10B ClimbMix tokens using 4x RTX 4090s. - Total time: 40 hours. - Total cost on vast ai: under $50! - Matches TinyLLama on many benchmarks despite 100x less training data - Up to 2x faster than BF16 training.
1
1
7
@DAlistarh
Dan Alistarh
21 days
Technical highlights in this first version: - Tensor-wise FP8 matmuls throughout training (E4M3 format) - Optimized PCIe communication for consumer cards - Offloading fits 14B models on 4x RTX4090s - Fine-grained activation checkpointing - 60-70% of speed-of-light on modern GPUs
1
0
3
@DAlistarh
Dan Alistarh
21 days
Introducing LLM.Q: Quantized LLM training in pure CUDA/C++! With LLM.Q, you can train your own LLM on consumer GPUs with natively quantized matmuls, on single workstations. No datacenter required. Inspired by @karpathy's llm.c, but natively quantized.
3
16
140
@DAlistarh
Dan Alistarh
1 month
Created by our interns Max Kleinegger & Michael Helcig, supported by the entire DASLab team. Get started: Code: https://t.co/lI8kNdJqU4 Based on: GPTQ (Frantar et al., ICLR23) & EvoPress (Sieberling et al., ICML25). Contributors welcome!
Tweet card summary image
github.com
GPTQ and efficient search for GGUF. Contribute to IST-DASLab/gptq-gguf-toolkit development by creating an account on GitHub.
1
4
24
@DAlistarh
Dan Alistarh
1 month
Academic compression research meets practical deployment: βœ… Deploy larger models on consumer hardware leveraging llama.cpp βœ… First GPTQβ†’GGUF implementation βœ… Works with llama.cpp & all GGUF tools βœ… Accuracy evaluation via eval suite
1
0
13
@DAlistarh
Dan Alistarh
1 month
πŸ“Š Benchmarked against popular Unsloth Dynamic 2.0 GGUFs on Llama 3.1 8B: 3-bit: 13.17 vs 13.75 perplexity (lower=better) 4.88-bit: 11.18 vs 11.23 perplexity Zero-shot tasks: comparable or superior Matching or improving on SOTA at every bitwidth tested.
1
0
11
@DAlistarh
Dan Alistarh
1 month
How it works: We use EvoPress (ICML25) evolutionary search to discover optimal per-layer configurations, together with GPTQ for quantization. E.g., attention layers might need 6 bits while FFN layers compress to 3 bits - all automatically optimized.
1
0
21
@DAlistarh
Dan Alistarh
1 month
We're releasing the DASLab GGUF Quantization Toolkit! πŸš€ First open-source toolkit bringing GPTQ + EvoPress to @ggerganov's GGUF format, enabling heterogeneous quantization based on importance. Result: Better models at the same file size. [1/5]
4
50
270
@DAlistarh
Dan Alistarh
1 month
In short, QuTLASS v0.1 makes microscaling practical for NVFP4 inference on Blackwell! πŸ‘‰ Try it here: https://t.co/KjO04HAHyq or directly in vLLM here: https://t.co/vlRlRErAG0 QuTLASS is driven by @RobertoL_Castro with help from @black_samorez and the entire DASLab team!
Tweet card summary image
github.com
Purpose This pull request brings in the QuTLASS library: https://github.com/iST-DASLab/qutlass QuTLASS is a high-performance library designed for low-precision kernel support in deep learning quant...
1
1
10
@DAlistarh
Dan Alistarh
1 month
πŸ“Š Performance - Benchmarks on Qwen3-32B (RTX5090) & Llama-3.1-70B (B200) showing end-to-end speedup - MXFP4 & NVFP4 kernels deliver near-optimal throughput - Up to 98% recovery vs FP16 on standard tasks (e.g. MMLU) - Experimental models at https://t.co/5KAG6VfDmC
1
0
5
@DAlistarh
Dan Alistarh
1 month
🧩 Usability and quantization options - Abs-Max Scaling - MSE / Quartet / Quest-like scaling - Multi-size rotations (16/32/64/128) for MXFP4 & NVFP4 - Seamless integration with vLLM (PR #24440), see below.
1
0
5
@DAlistarh
Dan Alistarh
1 month
✨ What’s new in QuTLASS v0.1.0: πŸ”Ή Support for NVIDIA B200 πŸ”Ή NVFP4 microscaling with full W4A4 quantization πŸ”Ή Online rotations: fused transform + quantization + scaling πŸ”Ή Runtime-loaded rotation matrices (flexible transforms!)
1
0
6
@DAlistarh
Dan Alistarh
1 month
πŸš€ Excited to announce QuTLASS v0.1.0 πŸŽ‰ QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]
3
34
221
@_akhaliq
AK
3 months
The Geometry of LLM Quantization GPTQ as Babai's Nearest Plane Algorithm
2
9
29