Daniel Haziza Profile
Daniel Haziza

@d_haziza

Followers
1K
Following
188
Media
49
Statuses
128

Research Engineer at Facebook AI Research Working on xformers, to make GPUs go Brrrr https://t.co/RMpoQat9Wj

Paris
Joined October 2022
Don't wanna be here? Send us removal request.
@d_haziza
Daniel Haziza
6 months
We provide more details and ablations in the paper, with performance measurements demonstrating the feasibility of efficient kernels for both training and inference with 24 sparsity! https://t.co/eTbZFSbFOU @CaiJesse will be presenting this paper at the SLLM workshop in ICLR
Tweet media one
1
0
10
@d_haziza
Daniel Haziza
6 months
And this is hard because (2) consecutive tokens are more likely to activate the same features. We can have 10 consecutive tokens with the same feature activated. We work around this issue by shuffling/unshuffling tokens before/after the FFN
Tweet media one
1
0
3
@d_haziza
Daniel Haziza
6 months
This is hard because (1) some individual features are not sparse (even tho the entire tensor is globally sparse). It's less than 5% of the features, so we disable sparsity for these very few features
Tweet media one
1
0
2
@d_haziza
Daniel Haziza
6 months
We however hit some issues in the BW pass - we need to 2:4 sparsify activations in the *features* dimension
Tweet media one
1
0
4
@d_haziza
Daniel Haziza
6 months
And it turns out we *can* put the activations in that format during training: 1 GEMM out of 2 is sparse in the FW pass, and 3 out of 4 in the BW pass!
Tweet media one
1
0
4
@d_haziza
Daniel Haziza
6 months
In other words, we spend a loooot of flops multiplying zeros! Can't we avoid that? Recent GPUs have hardware support for sparsity in matrix multiplications! As long as your tensor is in the 2:4 sparsity format: 2 zero values for every 4 consecutive values
Tweet media one
1
0
8
@d_haziza
Daniel Haziza
6 months
If you train a 7B model with srelu, you expect the activations post-relu to be 50% sparse. It's the case at initialization when all activations are normal/centered, but then the sparsity level skyrockets to 90%+ depending on the layer!
Tweet media one
1
0
10
@d_haziza
Daniel Haziza
6 months
We just released method to accelerate transformers with 2:4 sparsity ... for activations! 🚀 The main trick is to use Squared-ReLU ("srelu") as the activation function in your FFN (instead of eg SwiGLU). This change does not affect your LLM loss/benchmarks at 1B/7B scales.
Tweet media one
2
22
197
@d_haziza
Daniel Haziza
8 months
In the approach we tested and scaled (Memory Layers), 3 FFNs are replaced with a giant embedding lookup. A single token will only activate a small part of this embedding table (<1MB). In MoE, the first token to activate an expert adds ~GBs of activated parameters!
Tweet media one
1
0
6
@d_haziza
Daniel Haziza
8 months
With MoE, if 8 experts are activated by a token in your batch, decoding becomes ~5x slower than a dense model, despite having ~5x less flops. MoEs are good in 2 scenarios: batch size 1 (without speculative decoding), or very high batch size (training or prefilling)
Tweet media one
1
1
6
@d_haziza
Daniel Haziza
8 months
Autoregressive decoding is almost never compute-bound. It is memory bound: the time it takes to decode a token is the time it takes to read all the activated parameters.
1
0
4
@d_haziza
Daniel Haziza
8 months
I've been working on some alternative to Mixture of Experts (MoEs) for LLMs. In principle the same idea: same flops as dense, more parameters - not all activated. The difference? Much more efficient decoding, as MoEs are a nightmare for decoding workloads.
@gargighosh
Gargi Ghosh
8 months
Efficient scaling with Memory Layers, replacing FFNs with memory units: https://t.co/6DkkO0KvDK
Tweet media one
2
23
160
@d_haziza
Daniel Haziza
9 months
I'm in neurips! Reach out if you want to discuss xFormers, training/inference efficiency, CUDA kernels and more :) I'll spend some time at the Meta booth this week
Tweet media one
0
0
12
@d_haziza
Daniel Haziza
1 year
I'll be at the #PyTorchConf in SF this Wednesday and Thursday - presenting the latest work from the xFormers team (poster)
Tweet media one
0
0
7
@d_haziza
Daniel Haziza
1 year
You can find more details, from how we implemented this kernel, to the training recipe we used in this blogpost: https://t.co/MOcyAqDKwt
0
4
69
@d_haziza
Daniel Haziza
1 year
So we developed our own kernel for fast 2:4 sparsification, and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes!
Tweet media one
2
3
59
@d_haziza
Daniel Haziza
1 year
For 2:4 sparse training to be efficient, the sparsification process + the sparse matrix multiplication must be faster than the dense matrix multiplication. This is *not* the case when using available primitives (eg from cuSparseLt).
Tweet media one
1
1
22
@d_haziza
Daniel Haziza
1 year
2:4 sparsity is essentially meant for inference: the weight matrix no longer changes, and can be sparsified ahead of time. For training, it's more complicated.
1
0
22
@d_haziza
Daniel Haziza
1 year
These TensorCores can *only* do matrix multiplication, nothing more. But they have a special mode which is 2x faster: when the first operand is "2:4 sparse". A matrix is 2:4 sparse if every 4 values, at least 2 are zeros.
Tweet media one
1
1
27
@d_haziza
Daniel Haziza
1 year
When training NNs, 99%+ of the compute comes from matrix multiplication (linear layers, attention etc…). NVIDIA GPUs have specialized hardware for matmuls (TensorCores), and they provide 95% of the flops of the device. So you want to be fast, you need to use these TensorCores.
Tweet media one
1
1
27