Dan Alistarh @DAlistarh X Profile

Dan Alistarh

@DAlistarh

Followers

2K

Following

78

Media

41

Statuses

142

Professor at IST Austria

https://t.co/kKZp3HaDeh

Vienna

Joined May 2022

Don't wanna be here? Send us removal request.

Dan Alistarh

@DAlistarh

14 days

Paper: https://t.co/9jwxD1tU92 We release vLLM & HuggingFace integrations. Code: https://t.co/pvkXTyDlcO Kernels: https://t.co/KjO04HAHyq Credit goes to the team: @RobertoL_Castro, Vage, Denis, @black_samorez and @AshkboosSaleh, with support from @RedHat_AI and @thoefler

github.com

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning - GitHub - IST-DASLab/qutlass: QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

0

4

10

Dan Alistarh

@DAlistarh

14 days

QuTLASS kernels natively support any block-orthogonal rotations: - Weight rotations are directly applied before quantization - Activation micro-rotations applied efficiently at runtime - Works for both NVFP4 and MXFP4 on Blackwell - Currently, MXFP4 has a speed advantage (~10%)

1

0

4

Dan Alistarh

@DAlistarh

14 days

Key findings (W&A quantization): - NVFP4 outperforms MXFP4 for round-to-nearest (RTN); - For RTN, rotations provably help MXFP4 but not NVFP4; - Micro-rotated (MR) GPTQ helps MXFP4 recover to within 1-2% of NVFP4, boosting NVFP4 too. - Large models achieve 98-99% recovery.

1

0

3

Dan Alistarh

@DAlistarh

14 days

🚀 We are releasing state-of-the-art post-training quantization (PTQ) algorithms for Microscaling FP4, together with kernels: - First study focused on MXFP4/NVFP4 PTQ for LLMs - New Micro-Rotated (MR) format and GPTQ algorithm - QuTLASS GPU kernels with up to 3.6x speedups.

1

28

150

Saleh Ashkboos

@AshkboosSaleh

20 days

Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: https://t.co/Lm6wkExF92 [1/5]

3

8

29

Dan Alistarh

@DAlistarh

21 days

Credit goes to the main author, Erik Schultheis, and to @karpathy for the original llm.c that inspired us. FP4 training is next. The code is open source and are happy to take contributions. Code: https://t.co/LSuLL5GaKR Discord:

github.com

Quantized LLM training in pure CUDA/C++. Contribute to IST-DASLab/llmq development by creating an account on GitHub.

0

3

15

Dan Alistarh

@DAlistarh

21 days

Real example: - We trained a TinyLLama-quality 1.5B model on 10B ClimbMix tokens using 4x RTX 4090s. - Total time: 40 hours. - Total cost on vast ai: under $50! - Matches TinyLLama on many benchmarks despite 100x less training data - Up to 2x faster than BF16 training.

1

7

Dan Alistarh

@DAlistarh

21 days

Technical highlights in this first version: - Tensor-wise FP8 matmuls throughout training (E4M3 format) - Optimized PCIe communication for consumer cards - Offloading fits 14B models on 4x RTX4090s - Fine-grained activation checkpointing - 60-70% of speed-of-light on modern GPUs

1

0

3

Dan Alistarh

@DAlistarh

21 days

Introducing LLM.Q: Quantized LLM training in pure CUDA/C++! With LLM.Q, you can train your own LLM on consumer GPUs with natively quantized matmuls, on single workstations. No datacenter required. Inspired by @karpathy's llm.c, but natively quantized.

3

16

140

Dan Alistarh

@DAlistarh

1 month

Created by our interns Max Kleinegger & Michael Helcig, supported by the entire DASLab team. Get started: Code: https://t.co/lI8kNdJqU4 Based on: GPTQ (Frantar et al., ICLR23) & EvoPress (Sieberling et al., ICML25). Contributors welcome!

github.com

GPTQ and efficient search for GGUF. Contribute to IST-DASLab/gptq-gguf-toolkit development by creating an account on GitHub.

1

4

24

Dan Alistarh

@DAlistarh

1 month

Academic compression research meets practical deployment: ✅ Deploy larger models on consumer hardware leveraging llama.cpp ✅ First GPTQ→GGUF implementation ✅ Works with llama.cpp & all GGUF tools ✅ Accuracy evaluation via eval suite

1

0

13

Dan Alistarh

@DAlistarh

1 month

📊 Benchmarked against popular Unsloth Dynamic 2.0 GGUFs on Llama 3.1 8B: 3-bit: 13.17 vs 13.75 perplexity (lower=better) 4.88-bit: 11.18 vs 11.23 perplexity Zero-shot tasks: comparable or superior Matching or improving on SOTA at every bitwidth tested.

1

0

11

Dan Alistarh

@DAlistarh

1 month

How it works: We use EvoPress (ICML25) evolutionary search to discover optimal per-layer configurations, together with GPTQ for quantization. E.g., attention layers might need 6 bits while FFN layers compress to 3 bits - all automatically optimized.

1

0

21

Dan Alistarh

@DAlistarh

1 month

We're releasing the DASLab GGUF Quantization Toolkit! 🚀 First open-source toolkit bringing GPTQ + EvoPress to @ggerganov's GGUF format, enabling heterogeneous quantization based on importance. Result: Better models at the same file size. [1/5]

4

50

270

Dan Alistarh

@DAlistarh

1 month

In short, QuTLASS v0.1 makes microscaling practical for NVFP4 inference on Blackwell! 👉 Try it here: https://t.co/KjO04HAHyq or directly in vLLM here: https://t.co/vlRlRErAG0 QuTLASS is driven by @RobertoL_Castro with help from @black_samorez and the entire DASLab team!

github.com

Purpose This pull request brings in the QuTLASS library: https://github.com/iST-DASLab/qutlass QuTLASS is a high-performance library designed for low-precision kernel support in deep learning quant...

1

10

Dan Alistarh

@DAlistarh

1 month

📊 Performance - Benchmarks on Qwen3-32B (RTX5090) & Llama-3.1-70B (B200) showing end-to-end speedup - MXFP4 & NVFP4 kernels deliver near-optimal throughput - Up to 98% recovery vs FP16 on standard tasks (e.g. MMLU) - Experimental models at https://t.co/5KAG6VfDmC

1

0

5

Dan Alistarh

@DAlistarh

1 month

🧩 Usability and quantization options - Abs-Max Scaling - MSE / Quartet / Quest-like scaling - Multi-size rotations (16/32/64/128) for MXFP4 & NVFP4 - Seamless integration with vLLM (PR #24440), see below.

1

0

5

Dan Alistarh

@DAlistarh

1 month

✨ What’s new in QuTLASS v0.1.0: 🔹 Support for NVIDIA B200 🔹 NVFP4 microscaling with full W4A4 quantization 🔹 Online rotations: fused transform + quantization + scaling 🔹 Runtime-loaded rotation matrices (flexible transforms!)

1

0

6

Dan Alistarh

@DAlistarh

1 month

🚀 Excited to announce QuTLASS v0.1.0 🎉 QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]

3

34

221

AK

@_akhaliq

3 months

The Geometry of LLM Quantization GPTQ as Babai's Nearest Plane Algorithm

2

9

29