Dan Alistarh Profile
Dan Alistarh

@DAlistarh

Followers
1K
Following
70
Media
33
Statuses
122

Professor at IST Austria

Vienna
Joined May 2022
Don't wanna be here? Send us removal request.
@DAlistarh
Dan Alistarh
11 days
RT @_akhaliq: The Geometry of LLM Quantization. GPTQ as Babai's Nearest Plane Algorithm
Tweet media one
0
7
0
@DAlistarh
Dan Alistarh
20 days
RT @ESFoMo: @MatharyCharles @_albertgu And presenting a Best Poster award to "Unified Scaling Laws for Compressed Representations" by Andre….
0
3
0
@DAlistarh
Dan Alistarh
25 days
Contributors:. - QuTLASS is led by @RobertoL_Castro .- FP-Quant contributors @black_samorez @AshkboosSaleh @_EldarKurtic @mgoin_ as well as Denis Kuznedelev and Vage Egiazarian. Code and models: . - - -
Tweet card summary image
github.com
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning - GitHub - IST-DASLab/qutlass: QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
1
2
19
@DAlistarh
Dan Alistarh
25 days
To create quantized models, we provide FP-Quant: .- A quantization harness supporting FP4, NVFP4, and MXFP formats with various tricks.- Key finding: GPTQ+Had (GPTQ with block-wise Hadamard matching microscaling) preserves accuracy best!.Sample results below:
Tweet media one
1
0
7
@DAlistarh
Dan Alistarh
25 days
QuTLASS performance on RTX 5090: .📈 Consistent speedups across all batch sizes .📈 Peak ~4x faster than BF16 end-to-end (prefill) .📈 Optimized for both small (bs=1-32) and large batch sizes.Prefill results on Qwen3-8B:
Tweet media one
1
0
8
@DAlistarh
Dan Alistarh
25 days
QuTLASS provides support for Blackwell's native microscaling formats for both inference and training: .Key features (more coming): .✅ W4A4 inference .✅ Fused kernels for Hadamard transforms + quantization .✅ Multiple scaling formats .✅ @huggingface Transformers integration.
1
0
9
@DAlistarh
Dan Alistarh
25 days
Announcing our early work on FP4 inference for LLMs! .- QuTLASS: low-precision kernel support for Blackwell GPUs.- FP-Quant: a flexible quantization harness for Llama/Qwen .We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.
Tweet media one
Tweet media two
4
37
194
@DAlistarh
Dan Alistarh
1 month
RT @_EldarKurtic: Our flagship paper on how far careful quantization can really go in practice got accepted as an oral at ACL 2025 (top 8%)….
0
29
0
@DAlistarh
Dan Alistarh
2 months
Kudos go to the authors: @RobertoL_Castro @black_samorez.@JialeChenEdu.@rush_tabesh.@mmnnn76 @AshkboosSaleh . ArXiv: .Public code [WIP]: .HF Papers:
Tweet card summary image
huggingface.co
0
5
37
@DAlistarh
Dan Alistarh
2 months
⚡ On RTX 5090, Quartet’s highly optimized GPU kernels achieve up to:. - 2.4× speedup over FP8, 4× over BF16 (forward). - 1.6× backward speedup on RTX 5090. Our new efficient MXFP4 implementation makes large-scale FP4 training practical!
Tweet media one
2
1
26
@DAlistarh
Dan Alistarh
2 months
🛠 Quartet leverages:. - QuEST-based quantization-aware training (QAT) for minimal forward-pass error . - Unbiased stochastic rounding for stable backward-pass propagation . Our analysis reveals novel low-precision scaling laws to predict optimal accuracy-speed tradeoffs.
Tweet media one
1
0
24
@DAlistarh
Dan Alistarh
2 months
We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]
Tweet media one
20
78
398
@DAlistarh
Dan Alistarh
3 months
RT @utkuevci: This is something I've been working on with some amazing collaborators for a while. Model-software-hardware co-design. Making….
0
4
0
@DAlistarh
Dan Alistarh
4 months
RT @tjingrant: 📣 The Journey Matters: Our #ICLR2025 paper shows how to pretrain sparse LLMs with half the size of dense LLMs while maintain….
0
5
0
@DAlistarh
Dan Alistarh
4 months
This can be implemented efficiently by tweaking Rotary Positional Embeddings (RoPE), leads to better hardware utilization, and interesting synchronization trade-offs! .Paper draft: Code: .Example:
Tweet media one
1
1
18
@DAlistarh
Dan Alistarh
4 months
Can LLMs "reason" to solve a problem together? Yes!.We introduce Hogwild! Inference, enabling parallel LLM generation via concurrent attention: multiple LLMs work on the same attention cache, seeing each other's progress in real-time and solving tasks collaboratively. [1/3]
Tweet media one
6
70
321
@DAlistarh
Dan Alistarh
4 months
Thanks to all the contributors: @_EldarKurtic @JialeChenEdu @mgoin_ and Denis Kuznedelev . and a sneak preview of ongoing work:
Tweet media one
0
0
5
@DAlistarh
Dan Alistarh
4 months
Code: .Models @huggingface : .vLLM inference results:
Tweet media one
1
0
7
@DAlistarh
Dan Alistarh
4 months
Introducing MoE-Quant, a fast version of GPTQ for MoEs, with:.* Optimized Triton kernels and expert&data parallelism.* Quantizes the 671B DeepSeekV3/R1 models in 2 hours on 8xH100.* ~99% accuracy recovery for 4bit R1 on *reasoning* tasks, and 100% recovery on leaderboards .[1/3]
Tweet media one
3
32
162
@DAlistarh
Dan Alistarh
4 months
Credit goes to the authors @black_samorez @JialeChenEdu @rush_tabesh @mmnnn76 @RobertoL_Castro . Trained model samples available on HuggingFace: .
Tweet card summary image
huggingface.co
0
0
5