elias_frantar Profile Banner
Elias Frantar Profile
Elias Frantar

@elias_frantar

Followers
488
Following
845
Media
26
Statuses
76

Researcher @OpenAI | prev. PhD @ISTAustria and intern @GoogleDeepmind | I also build super fast Lego Rubik's Cube robots.

San Francisco, CA
Joined February 2015
Don't wanna be here? Send us removal request.
@elias_frantar
Elias Frantar
2 years
Excited to share our work "Scaling Laws for Sparsely-Connected Foundation Models" ( where we develop the first scaling laws for (fine-grained) parameter-sparsity in the context of modern Transformers trained on massive datasets. 1/10.
3
26
124
@elias_frantar
Elias Frantar
11 months
RT @DAlistarh: Happy to release the write-up on the MARLIN kernel for fast LLM inference, now supporting 2:4 sparsity! .Led by @elias_fran….
0
22
0
@elias_frantar
Elias Frantar
1 year
RT @efxmarty: AutoGPTQ 0.7.0 is released and includes @elias_frantar's Marlin kernel for int4*fp16 matrix multiplication on Ampere GPUs. Ch….
0
8
0
@elias_frantar
Elias Frantar
2 years
@DAlistarh and I hope that Marlin will help to unlock the full potential of 4-bit inference for open-source models, now also in settings that require batchsizes significantly larger than 1!.
1
0
7
@elias_frantar
Elias Frantar
2 years
Marlin achieves its speed by simultaneously saturating global/L2/shared/tensor/vector performance via careful organization of computations, several levels of software-pipelining, ideal memory access patterns and generally extremely optimized (down to assembly level) code. 5/N.
1
0
10
@elias_frantar
Elias Frantar
2 years
Further, we demonstrate how to create accurate Marlin-compatible models using an improved version of GPTQ, with better grid-clipping and non-uniform calibration sample length (WikiText2 and RedPajama are PPL and MMLU 0-shot below). 4/N
Tweet media one
1
0
8
@elias_frantar
Elias Frantar
2 years
We also test performance that is sustainable over longer periods of time, thus forcing base clocks. Interestingly, we find that existing kernels lose significant performance if clocks must be lowered from boost due to continuous load; meanwhile, Marlin remains almost optimal. 3/N
Tweet media one
1
0
8
@elias_frantar
Elias Frantar
2 years
Due to its "striped" partitioning scheme, which evenly balances (not necessarily consecutive) tiles across all SMs, Marlin delivers strong performance across most matrix shapes of popular models and GPUs. 2/N
Tweet media one
1
0
8
@elias_frantar
Elias Frantar
2 years
Happy to release Marlin, a 4bitx16bit linear kernel for LLM inference with near-ideal (4x) speedup up to batchsizes of 16-32 tokens (4-8x bigger than prior work); aimed at larger-scale serving, speculative decoding, and multi-way inference schemes. 1/N.
Tweet media one
8
56
255
@elias_frantar
Elias Frantar
2 years
RT @DAlistarh: Happy to release QUIK, a new accurate post-training quantization method which processes the majority of weights and activati….
0
37
0
@elias_frantar
Elias Frantar
2 years
We hope that QMoE will make deployment of and research with massive MoEs cheaper and more accessible. Work done together with @DAlistarh at @ISTAustria!. 8/8.
0
0
1
@elias_frantar
Elias Frantar
2 years
With QMoE compression and kernels we can perform full end-to-end inference of the 1.6 trillion parameter SwitchTransformer-c2048 on 4x A6000 or 8x 3090 GPUs, at < 5% overhead relative to (ideal) uncompressed execution (requiring ~20x more GPUs). 7/8
Tweet media one
1
0
1
@elias_frantar
Elias Frantar
2 years
Doing this in a way that allows fast on-the-fly decoding during GPU inference requires very careful co-design of a custom compression format and corresponding bespoke GPU kernels, to efficiently handle various issues introduced by variable-length codes. 6/8
Tweet media one
1
0
1
@elias_frantar
Elias Frantar
2 years
Such highly compressed models also exhibit high natural sparsity, and correspondingly low entropy, which we can exploit to push compression rates even more, to < 1-bit per parameter. 5/8
Tweet media one
1
0
1
@elias_frantar
Elias Frantar
2 years
With the help of this framework, we notice that such massive models can actually be compressed significantly further than standard dense models, to 2-bit or even ternary precision, at only a small accuracy loss. 4/8
Tweet media one
1
0
1
@elias_frantar
Elias Frantar
2 years
Applying accurate compression methods like GPTQ to trillion-parameter MoEs is challenging and requires a variety of systems optimizations around memory management, compute utilization and (numerical) robustness. 3/8
Tweet media one
1
0
1
@elias_frantar
Elias Frantar
2 years
Mixture of Expert (MoE) models are significantly faster than dense models of the same accuracy. However, they are also much larger, which limits their practicality. We address this challenge via QMoE with sub-1-bit compression. 2/8.
1
0
2
@elias_frantar
Elias Frantar
2 years
Excited to announce QMoE, the first framework able to accurately compress a 1.6 trillion parameter model, by 20x = 0.8 bits per parameter, to run on 4 GPUs at close to no overhead over uncompressed inference. Paper: GitHub: 🧵1/8.
3
25
111
@elias_frantar
Elias Frantar
2 years
RT @mgoin_: Exciting news from our latest LLM compression research! 🚀 Together with @ISTAustria and @neuralmagic, we’ve been exploring spar….
0
40
0
@elias_frantar
Elias Frantar
2 years
This paper is a result of my internship at Google DeepMind and is joint work with @rikelhood @neilhoulsby @DAlistarh and @utkuevci.
0
0
7
@elias_frantar
Elias Frantar
2 years
If a highly accurate pretrained model is available, then using it to bootstrap sparsification is significantly more efficient than starting from scratch, otherwise this strategy is a lot slower. 10/10
Tweet media one
1
0
5