elias_frantar Profile Banner
Elias Frantar Profile
Elias Frantar

@elias_frantar

Followers
509
Following
1K
Media
26
Statuses
77

Researcher @OpenAI | prev. PhD @ISTAustria and intern @GoogleDeepmind | I also build super fast Lego Rubik's Cube robots.

San Francisco, CA
Joined February 2015
Don't wanna be here? Send us removal request.
@elias_frantar
Elias Frantar
2 years
Excited to share our work "Scaling Laws for Sparsely-Connected Foundation Models" ( https://t.co/KGJ3kGpL1n) where we develop the first scaling laws for (fine-grained) parameter-sparsity in the context of modern Transformers trained on massive datasets. 1/10
Tweet card summary image
arxiv.org
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting,...
3
25
124
@ELLISforEurope
ELLIS
4 days
๐Ÿ‘ Give a big round of applause to our 2025 PhD Award Winners! The two main winners are: @ZhijingJin & @maksym_andr. Two runners-up were selected additionally: @SiweiZhang13 & @elias_frantar Learn even more about each outstanding scientist: https://t.co/3Pry7NnAYn
2
5
43
@DAlistarh
Dan Alistarh @ NeurIPS
1 year
Happy to release the write-up on the MARLIN kernel for fast LLM inference, now supporting 2:4 sparsity! Led by @elias_frantar & @RobertoL_Castro Paper: https://t.co/lT6EtMoyEY Code: https://t.co/r58fIm8zWB MARLIN is integrated with @vllm_project thanks to @neuralmagic!
Tweet card summary image
github.com
Boosting 4-bit inference kernels with 2:4 Sparsity - IST-DASLab/Sparse-Marlin
3
22
73
@efxmarty
efxmarty
2 years
AutoGPTQ 0.7.0 is released and includes @elias_frantar's Marlin kernel for int4*fp16 matrix multiplication on Ampere GPUs. Check out https://t.co/UAwJCfaGp8 - This is usable with any int4 quantized Transformers model (symmetric quantization, no act-order) directly from the Hub!๐Ÿงต
Tweet card summary image
github.com
Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading @efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with...
1
8
24
@elias_frantar
Elias Frantar
2 years
@DAlistarh and I hope that Marlin will help to unlock the full potential of 4-bit inference for open-source models, now also in settings that require batchsizes significantly larger than 1!
1
0
7
@elias_frantar
Elias Frantar
2 years
Marlin achieves its speed by simultaneously saturating global/L2/shared/tensor/vector performance via careful organization of computations, several levels of software-pipelining, ideal memory access patterns and generally extremely optimized (down to assembly level) code. 5/N
1
0
10
@elias_frantar
Elias Frantar
2 years
Further, we demonstrate how to create accurate Marlin-compatible models using an improved version of GPTQ, with better grid-clipping and non-uniform calibration sample length (WikiText2 and RedPajama are PPL and MMLU 0-shot below). 4/N
1
0
8
@elias_frantar
Elias Frantar
2 years
We also test performance that is sustainable over longer periods of time, thus forcing base clocks. Interestingly, we find that existing kernels lose significant performance if clocks must be lowered from boost due to continuous load; meanwhile, Marlin remains almost optimal. 3/N
1
0
8
@elias_frantar
Elias Frantar
2 years
Due to its "striped" partitioning scheme, which evenly balances (not necessarily consecutive) tiles across all SMs, Marlin delivers strong performance across most matrix shapes of popular models and GPUs. 2/N
1
0
8
@elias_frantar
Elias Frantar
2 years
Happy to release Marlin, a 4bitx16bit linear kernel for LLM inference with near-ideal (4x) speedup up to batchsizes of 16-32 tokens (4-8x bigger than prior work); aimed at larger-scale serving, speculative decoding, and multi-way inference schemes. 1/N https://t.co/Ro4oBWThZB
8
56
254
@DAlistarh
Dan Alistarh @ NeurIPS
2 years
Happy to release QUIK, a new accurate post-training quantization method which processes the majority of weights and activations using 4bit precision. [1/N] With @AshkboosSaleh @elias_frantar @thoefler Paper: https://t.co/ErRgJ9Wnqw Code: https://t.co/WuAzZn3ugX Snapshot:
7
37
158
@elias_frantar
Elias Frantar
2 years
We hope that QMoE will make deployment of and research with massive MoEs cheaper and more accessible. Work done together with @DAlistarh at @ISTAustria! 8/8
0
0
1
@elias_frantar
Elias Frantar
2 years
With QMoE compression and kernels we can perform full end-to-end inference of the 1.6 trillion parameter SwitchTransformer-c2048 on 4x A6000 or 8x 3090 GPUs, at < 5% overhead relative to (ideal) uncompressed execution (requiring ~20x more GPUs). 7/8
1
0
1
@elias_frantar
Elias Frantar
2 years
Doing this in a way that allows fast on-the-fly decoding during GPU inference requires very careful co-design of a custom compression format and corresponding bespoke GPU kernels, to efficiently handle various issues introduced by variable-length codes. 6/8
1
0
1
@elias_frantar
Elias Frantar
2 years
Such highly compressed models also exhibit high natural sparsity, and correspondingly low entropy, which we can exploit to push compression rates even more, to < 1-bit per parameter. 5/8
1
0
1
@elias_frantar
Elias Frantar
2 years
With the help of this framework, we notice that such massive models can actually be compressed significantly further than standard dense models, to 2-bit or even ternary precision, at only a small accuracy loss. 4/8
1
0
1
@elias_frantar
Elias Frantar
2 years
Applying accurate compression methods like GPTQ to trillion-parameter MoEs is challenging and requires a variety of systems optimizations around memory management, compute utilization and (numerical) robustness. 3/8
1
0
1
@elias_frantar
Elias Frantar
2 years
Mixture of Expert (MoE) models are significantly faster than dense models of the same accuracy. However, they are also much larger, which limits their practicality. We address this challenge via QMoE with sub-1-bit compression. 2/8
1
0
2
@elias_frantar
Elias Frantar
2 years
Excited to announce QMoE, the first framework able to accurately compress a 1.6 trillion parameter model, by 20x = 0.8 bits per parameter, to run on 4 GPUs at close to no overhead over uncompressed inference. Paper: https://t.co/moTdmYsQDz GitHub: https://t.co/7WeAwDHHSy ๐Ÿงต1/8
Tweet card summary image
github.com
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models". - IST-DASLab/qmoe
3
25
111
@mgoin_
Michael Goin
2 years
Exciting news from our latest LLM compression research! ๐Ÿš€ Together with @ISTAustria and @neuralmagic, weโ€™ve been exploring sparse finetuning for LLMs and achieved 7.7 tokens/second on a single core and at 26.7 tokens/second on 4 cores of an AMD Ryzen CPU! (1/n)
5
40
146
@elias_frantar
Elias Frantar
2 years
This paper is a result of my internship at Google DeepMind and is joint work with @rikelhood @neilhoulsby @DAlistarh and @utkuevci.
0
0
7