@neilhoulsby
Neil Houlsby
9 months
Great work from Elias figuring out optimal scaling for large sparse models. I find particularly intriguing that sparsity unlocks a (half-)plane of optimal models on the compute/size axes. While regular dense Scaling Laws define only a single line of optimal models (for a given…
Tweet media one
@elias_frantar
Elias Frantar
9 months
Excited to share our work "Scaling Laws for Sparsely-Connected Foundation Models" () where we develop the first scaling laws for (fine-grained) parameter-sparsity in the context of modern Transformers trained on massive datasets. 1/10
3
26
129
1
2
26

Replies

@Francsbi
Francesco Bisignano
9 months
@neilhoulsby If we include parameters=0+ dY, Y being the FlOPs axis, given any "reduction", e or so?
0
0
0