Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params! My co-authors have already posted about our amazing results, so here's a 🧵on how we got there! Tweet added by Soham De @sohamde_

Soham De

3 months

Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params! My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!

Griffin: Mixing Gated Linear Recurrences with Local Attention for...

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear...

arxiv.org

12

62

318

Soham De

@sohamde_

3 months

Our work built on S4/S4D from @_albertgu et al, as well as our own work on LRU led by @orvieto_antonio , which simplified S4/S4D without sacrificing performance: These models are blazingly fast at inference, but no one had scaled them up on language yet.

Resurrecting Recurrent Neural Networks for Long Sequences

Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably...

arxiv.org

1

2

31

Soham De

@sohamde_

3 months

Initially at small scale, we saw S4D/LRU layers actually performed worse than even vanilla well-tuned LSTMs! Why not just use LSTMs then? LSTMs have the same inference speed benefits as S4D/LRU, but are too slow to train and therefore cannot be scaled up!

1

9

Soham De

@sohamde_

3 months

S4D/LRU is much easier to scale since it uses a diagonal recurrent matrix. But how do we improve performance? Inspired by LSTMs, we added a Recurrent Gate to LRU to allow discarding the input at time t and preserve info from the history (“RG-LRU”). This matched LSTM performance!

3

0

16

Soham De

@sohamde_

3 months

What about the overall architecture? 1. The MLP block makes the model expressive! () We found Gated MLP blocks to be better than vanilla MLPs. 2. The recurrent block can be simple. Temporal Conv1D helps capture local functions RNNs struggle to express.

2

0

8

Soham De

@sohamde_

3 months

But this still doesn’t match the performance of a *well-tuned* transformer! Solution: simply use a Local Attn (LA) block every 2 recurrent blocks! LA has fixed-size state and so fast inference! LA captures the recent past, while RG-LRU models global structure. This is Griffin!

2

0

11

Soham De

@sohamde_

3 months

RG-LRU was still slow to train though as it was memory-bound. Inspired by @tri_dao 's Flash Attn, we used Pallas to optimize HBM accesses → 3x speedup! We also use linear scans instead of parallel scans as its faster on TPUs. Griffin now matches/beats Transformer training speed!

1

0

12

Soham De

@sohamde_

3 months

There were many more details involved in getting the project to succeed, but if I had to summarize our team's approach, we focussed on: 1. Simple ideas that scale 2. Attention to detail in model design & implementation 3. Careful hyperparameter tuning

1

50

Soham De

@sohamde_

3 months

As an additional point: one thing we really focused on was running fair comparisons against a well-tuned Transformer baseline. Our baseline performs remarkably well on downstream evals, outperforming many well-known models while being trained on significantly fewer tokens.

1

0

16

Soham De

@sohamde_

3 months

This was a long journey! Huge thanks to a world class team in model design, optimization and engineering. Learnt tons from all of you! @SamuelMLSmith , @botev_mg , @AnushanFer61200 , @GeorgeMuraru , @_albertgu , Ruba, @LeonardBerrada , @yeewhye , Razvan, @NandoDF , @caglarml and others!

0

17

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

@sohamde_ This is awesome. I've always thought transformers lose something by completely eliminating recurrence

0

1

Aaditya Ura ( looking for PhD )

@aadityaura

3 months

@sohamde_ This is a really interesting & well-written paper, Enjoyed reading it. I have a few questions: From the paper, Scaling curves show continued improvements with model size. How do you expect performance to change as Griffin is scaled up even further to 100B+ parameters or more?…

0

1

Aaditya Ura ( looking for PhD )

@aadityaura

3 months

@sohamde_ Dedicating this to the legend @SchmidhuberAI 🤠

0

3

Synthical

@synthical_ai

3 months

@sohamde_ Dark mode for this paper 🌚

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences,...

synthical.com

0

Jonathan Richard Schwarz

@schwarzjn_

3 months

@sohamde_ Congrats Soham! Awesome paper

0

1

Ashwinee Panda

@PandaAshwinee

3 months

@sohamde_ Awesome work at usual Soham!

0

1

মিতদ্রু

@Mitodru

3 months

@sohamde_ You should have your own startup man, great work though 🎉

0

Soham De

@sohamde_

3 months

@pengzhangzhi1 Yes we have. Gated MLPs work better than vanilla MLPs. We have ablations on different window sizes of Local Attention vs Global Attention in the appendix of the paper.

0

Tsing

@Tsingggg

3 months

@sohamde_ Paper is great but this ain't no release

0

Arun Patala

@ArunPatala

3 months

@sohamde_ @NandoDF Code please or weights

0

Replies