@sohamde_
Soham De
3 months
Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params! My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!
12
62
318

Replies

@sohamde_
Soham De
3 months
Our work built on S4/S4D from @_albertgu et al, as well as our own work on LRU led by @orvieto_antonio , which simplified S4/S4D without sacrificing performance: These models are blazingly fast at inference, but no one had scaled them up on language yet.
1
2
31
@sohamde_
Soham De
3 months
Initially at small scale, we saw S4D/LRU layers actually performed worse than even vanilla well-tuned LSTMs! Why not just use LSTMs then? LSTMs have the same inference speed benefits as S4D/LRU, but are too slow to train and therefore cannot be scaled up!
1
1
9
@sohamde_
Soham De
3 months
S4D/LRU is much easier to scale since it uses a diagonal recurrent matrix. But how do we improve performance? Inspired by LSTMs, we added a Recurrent Gate to LRU to allow discarding the input at time t and preserve info from the history (“RG-LRU”). This matched LSTM performance!
Tweet media one
3
0
16
@sohamde_
Soham De
3 months
What about the overall architecture? 1. The MLP block makes the model expressive! () We found Gated MLP blocks to be better than vanilla MLPs. 2. The recurrent block can be simple. Temporal Conv1D helps capture local functions RNNs struggle to express.
Tweet media one
2
0
8
@sohamde_
Soham De
3 months
But this still doesn’t match the performance of a *well-tuned* transformer! Solution: simply use a Local Attn (LA) block every 2 recurrent blocks! LA has fixed-size state and so fast inference! LA captures the recent past, while RG-LRU models global structure. This is Griffin!
2
0
11
@sohamde_
Soham De
3 months
RG-LRU was still slow to train though as it was memory-bound. Inspired by @tri_dao 's Flash Attn, we used Pallas to optimize HBM accesses → 3x speedup! We also use linear scans instead of parallel scans as its faster on TPUs. Griffin now matches/beats Transformer training speed!
Tweet media one
1
0
12
@sohamde_
Soham De
3 months
There were many more details involved in getting the project to succeed, but if I had to summarize our team's approach, we focussed on: 1. Simple ideas that scale 2. Attention to detail in model design & implementation 3. Careful hyperparameter tuning
1
1
50
@sohamde_
Soham De
3 months
As an additional point: one thing we really focused on was running fair comparisons against a well-tuned Transformer baseline. Our baseline performs remarkably well on downstream evals, outperforming many well-known models while being trained on significantly fewer tokens.
1
0
16
@sohamde_
Soham De
3 months
This was a long journey! Huge thanks to a world class team in model design, optimization and engineering. Learnt tons from all of you! @SamuelMLSmith , @botev_mg , @AnushanFer61200 , @GeorgeMuraru , @_albertgu , Ruba, @LeonardBerrada , @yeewhye , Razvan, @NandoDF , @caglarml and others!
0
0
17
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
@sohamde_ This is awesome. I've always thought transformers lose something by completely eliminating recurrence
0
0
1
@aadityaura
Aaditya Ura ( looking for PhD )
3 months
@sohamde_ This is a really interesting & well-written paper, Enjoyed reading it. I have a few questions: From the paper, Scaling curves show continued improvements with model size. How do you expect performance to change as Griffin is scaled up even further to 100B+ parameters or more?…
0
0
1
@aadityaura
Aaditya Ura ( looking for PhD )
3 months
@sohamde_ Dedicating this to the legend @SchmidhuberAI 🤠
Tweet media one
0
0
3
@synthical_ai
Synthical
3 months
@sohamde_ Dark mode for this paper 🌚
0
0
0
@schwarzjn_
Jonathan Richard Schwarz
3 months
@sohamde_ Congrats Soham! Awesome paper
0
0
1
@PandaAshwinee
Ashwinee Panda
3 months
@sohamde_ Awesome work at usual Soham!
0
0
1
@Mitodru
মিতদ্রু
3 months
@sohamde_ You should have your own startup man, great work though 🎉
0
0
0
@sohamde_
Soham De
3 months
@pengzhangzhi1 Yes we have. Gated MLPs work better than vanilla MLPs. We have ablations on different window sizes of Local Attention vs Global Attention in the appendix of the paper.
0
0
0
@Tsingggg
Tsing
3 months
@sohamde_ Paper is great but this ain't no release
0
0
0
@ArunPatala
Arun Patala
3 months
@sohamde_ @NandoDF Code please or weights
0
0
0