Excited to announce RTF is the first release of a new research group at
@LiquidAI_
, led by
@Massastrello
and I. We will keep focusing on the foundations of architecture design, scaling, and systems for efficient training and inference, building on our work on deep signal
[1/6] Releasing the Rational Transfer Function (RTF) parametrization for linear time invariant (LTI) and weakly input-varying (wLIV) sequence models.
New SOTA for SSMs on LRA, and improved perplexity on language modeling with Hyena.
Attention is great. Are there other operators that scale?
Excited to share our work on Hyena, an alternative to attn that can learn on sequences *10x longer*, up to *100x faster* than optimized attn, by using implicit long convolutions & gating
📜 1/
📢New research on mechanistic architecture design and scaling laws.
- We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date
- For the first time, we show that architecture performance on a set of isolated token
[1/4] Excited to share the first experimental release of *torchdyn* , a PyTorch library for all things neural differential equations! torchdyn is developed by the core DiffEqML team.
@Massastrello
@Diffeq_ml
We've been hard at work pushing the frontiers of efficient architecture design and optimization. StripedHyena-7B is the result: the first alternative architecture truly competitive with the best Transformers of its size or larger.
And it's very fast.
Announcing StripedHyena 7B — an open source model using an architecture that goes beyond Transformers achieving faster performance and longer context.
It builds on the lessons learned in past year designing efficient sequence modeling architectures.
Hungry for more content on efficient long context models after
@srush_nlp
's awesome keynote? We put together some of our perspectives in a short note:
Do we need Attention? (v0 ):
Slides for a survey talk summarizing recent Linear RNN models with a focus on NLP. Tries to cover a lot of different S4-related models (as well as RWKV/MEGA) in a digestible way.
Join us Dec 14th (EST time) for the NeurIPS workshop "The Symbiosis of Deep Learning and Differential Equations":
This is also your chance to submit questions to our great lineup of panelists, via:
[1/n] The community has been hard at work to speed up Neural ODEs, e.g. regularization strategies
@DavidDuvenaud
@chuckberryfinn
to keep the ODE easy to solve. We've also been thinking about the same problem, and we propose a different (compatible!) direction.
@Massastrello
In case you missed it: a new 7B StripedHyena model is out (the longest context one, yet), Evo-1 7B 🧬.
And it now runs in a single notebook (powered by
@togethercompute
), from DNA generation to protein fold.
I'm going to be at NeurIPS to present work on efficient model architecture and inference (with
@exnx
@Massastrello
and others)
HyenaDNA:
Laughing Hyena:
Excited to catch up with old friends and make some new ones - DM if you'd
The website for our 'The Symbiosis of Deep Learning and Differential Equations'
#NeurIPS2021
workshop is up:
We have a special track for already published papers. Share your work from adjacent fields with the NeurIPS community!
Deadline: Sept. 17 AoE
[1/2] Another year, another NeurIPS!
"Neural Hybrid Automata: Learning Dynamics With Multiple Modes and Stochastic Transitions" accepted at
#NeurIPS2021
Come by to chat about NHA, a method to learn stochastic hybrid systems!
Poster: Dec 07 08:30 AM -- 10:00 AM (PST).
A primer on effortless Neural ODE models through
@PyTorchLightnin
and torchdyn
@Diffeq_ml
.
We cover NDE boilerplate, Lightning does all the rest! This is only the beginning: we are extending the ecosystem with + integration DiffEqML <-> PyL /
@GridAI_
Introducing Evo: a long-context biological model based on StripedHyena that generalizes across DNA, RNA, and proteins. It is capable of prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length).
As our first NeurIPS experience, I have to say the results surpassed even the wildest of expectations. This is the culmination of a team effort with my dear friend
@Massastrello
, leading to
@Diffeq_ml
as an open-source effort for neural differential equations.
Neural ODE training can be difficult to get right. We find inspecting the adjoint flows can help in these situations. You can now easily access these quantities in torchdyn's models and log / visualize with your preferred
@PyTorchLightnin
utils.
Hyena is a convolutional layer for LLMs that can shrink the gap with attention, while scaling *subquadratically* in seq len (eg train a lot faster @ 64k + train 100k+ tokens!) 2/
blogs: ,
code:
Check out our code here. We’d love to hear from you about more applications. Let’s push the limits of context lengths in lang, vision, bio and more! 12/n
code:
More excellent work on modernizing linear attn. / linear RNNs.
Architecture design in 2024 is going to get even more sophisticated: we now have a variety of powerful "modern" primitives to choose from, each with different strengths.
Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)
Join us for the second edition of the
#NeurIPS2022
workshop "The Symbiosis of Deep Learning and Differential Equations"🌀
We're looking for your AI <> DE ideas: neural diff. eqs., neural operators, diffusion models and novel applications!
website:
Armed w/ synthetic tasks, we honed in on what makes attn so special, narrowing down 3 key properties: 1. It’s data-controlled 2. has sublinear parameter scaling (in seq len) 3. global context. Hyena achieves all 3 w/ a combo of long convs & element-wise multiplicative gating. 5/
The code implementation is remarkably straightforward. Some exciting news for signal processing fans: we find filter parametrization (& custom input projections!) to be one of the most impactful design choices, & come up w/ some recommendations, see paper :) 6/
We started w/ great work on dense-attention-free models for language, eg H3!, which paired w/ few attn layers can match Transformers at 2.7B params. But how far can we get w/o any attn? On autoregressive language tasks (same tokenizer), we observed a gap in quality 3/
Just like attention, Hyena can be used in ViTs on ImageNet, suggesting mechanistic design benchmarks may help on perf beyond language. So excited for the potential of Hyena as a general deep learning operator especially in domains where long-range interactions are critical 10/
Took a while. Get in touch if you're interested in contributing to open-source for neural diff eqs and implicit models! We have a lot of other interesting projects and collaborations underway.
[1/6] Announcing **torchdyn version 1.0**: !
@MichaelPoli6
@Massastrello
.
We roughly doubled the number of tutorials (optimal control, parallel-in-time solvers, hybrid systems), added new models and developed a numerics suite for diff eqs and root finding
We introduce a framework for fast prototyping and testing of new architecture designs, called mechanistic architecture design (MAD).
MAD includes a collection of token manipulation tasks as unit tests of model capabilities: compression, recall, noisy recall, memorization... as
How do I stripe my model? We find optimal hybridization ratios (Hyena - MHA): ~25% of layers should be attention (at sequence length 8k), <25% if trying to balance perplexity and state size of the model.
And the ordering / topology? More on that in the paper
Building on the groundwork by H3 using synthetics for model design, we tried to simulate performance gaps on simple grokking tasks that take a few mins to run. Surprisingly, one can recreate gaps by tweaking difficulty of synthetic tasks (seq length & vocab size)! 4/
We use MAD to identify promising architectures, including striped and MoE variants. Then, perform an extensive compute-optimal (and beyond compute-optimal) scaling laws analysis of emerging architectures.
Fun fact: the optimal allocation of tokens to model size varies
DSL for tile-based computation + Hopper arch. feature utilization for great out-of-the-box performance versus PyTorch and Triton. TK is quite fun to read and write, tons of potential applications in efficient model architectures!
(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide.
We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes
With a way to measure progress on our small mechanistic design benchmarks, we refine the design of Hyena, and observe that particular parametrizations of the long convolutions scale more favorably in sequence length and vocabulary size 7/
On The Pile, we see the performance gaps with Transformers start to close, given a fixed FLOP budget. (Hyenas are crafty creatures: they don't leave perplexity points on the table!) 8/
The deadline for our
#NeurIPS2021
workshop has been moved. More time to refine your submissions at the intersection of learning and differential equations!
New submission deadline: September 24th
Join us today Wed 30 at J
#431
for fractals and collage representations () and J
#123
for learning in frequency domain, neural operators and long-range dependencies (). Both at 11.00am - 1.00pm, catch me bouncing between posters!
Scaling laws on DNA pretraining? New sparsely gated layers such as Hyena experts?
Check the paper for more! We open-source the MAD pipeline for anyone to test their architectures!
Happy to share our latest research
#NeurIPS2021
Multiple Shooting Layers (MSL): new parallel-in-time, implicit model that achieves speedups via parallelization and solution reuse.
Neural Hybrid Automata (NHA): learning stochastic, multi-mode hybrid systems.
@Massastrello
@Diffeq_ml
@samgreydanus
[3/4] While significant progress is being made by Julia devs and SciML
@ChrisRackauckas
, we believe a continuous NN library for PyTorch to be of value to our research ecosystem. We leverage PyTorch Lightning's
@_willfalcon
sweet API to handle training loops.
❗New deadline for the
#NeurIPS2022
workshop❗
"Symbiosis of Deep Learning and Differential Equations":
October 1st.
Website: . Send us your work on neural differential equations, learnable numerical methods, continuous-time diffusion and more!
Heading to Honolulu for ICML now!
Come talk to us about Hyena (or HyenaDNA) at our poster session :)
Poster: Wed 26th at 2pm HST.
Oral talk: Thurs 348pm HST
I’m here all week, feel free to reach out. Looking forward to all the great research chatter!
@MichaelPoli6
At ICML soon! Happy to chat about all things LLM training, efficient (alternative) architectures, long context, signal processing and dynamical systems!
[1/2] Robustness work often starts with motivations "is the duck classified as a duck because of duck cues or because it is often paired with water backgrounds, and the model picks up on that instead?"
Another great news: our recent paper on analysis of the shortcut learning problem is accepted at ICLR 2022!
We answer "why is color always preferred by DNNs?"
It is a very interesting paper, and worth reading it :^)
Congrats to all authors
@ScimecaLuca
@coallaoh
@MichaelPoli6
A key feature of emerging architectures is the fact that they have a fixed state size for autoregressive inference.
We study the total state size of recurrent and striped models in a compute-optimal regime, finding interesting trade-offs between homogeneous and striped models.
This is a big deal for the NDE community. JAX, PyTorch and Julia are now all supported to various degrees; looking forward to seeing even more applications!
⭐️Announcing Diffrax!⭐️
Numerical differential equation solvers in
#JAX
.
Very efficient, and with oodles of fun features!
GitHub:
Docs:
Install: `pip install diffrax`
🧵 1/n
@BlancheMinerva
Thanks
@BlancheMinerva
! Compute is the main bottleneck right now. We are working on some lower-level optimizations that will hopefully make scaling easier for everyone.
[2/2] "Differentiable Multiple Shooting Layers"
Implicit, parallel-in-time models. We investigate how to perform fast inference via tracking of solutions across training iterations!
Poster: Thu Dec 09 08:30 AM -- 10:00 AM
@Diffeq_ml
@Massastrello
[2/6] Our vision for torchdyn is to become the torchvision/audio for diff eqs and implicit models. There is no better time to get involved! Below: Multiple Shooting Layers as implicit, parallel-in-time Neural ODEs. Speed ups via solution reuse across iterations!
StripedHyena 7B is a hybrid architecture based on our latest on scaling laws or alternative architectures and fast inference with gated convolutions.
It's a longer context model, with strong performance across a variety of standard language benchmarks, with 50% smaller caches,
How is MAD predictive of scaling laws performance? We study correlation between aggregate task performance and compute-optimal perplexity, and find strong correlation at all scales, particularly in models of a similar base class.
We use this fact to iteratively improve the
Meet LTM-1: LLM with *5,000,000 prompt tokens*
That's ~500k lines of code or ~5k files, enough to fully cover most repositories.
LTM-1 is a prototype of a neural network architecture we designed for giant context windows.
Stoked about this work! >60 utilization for gated convolutions makes new architectures even more compelling as Transformer replacements, with faster e2e training at shorter AND longer sequences.
Announcing FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores!
We speed up exact FFT convolutions by up to 7.93x over PyTorch, reduce memory footprint, and get 4.4x speedup end-to-end. Read on for more details:
Thanks
@arankomatsuzaki
and
@_akhaliq
for
We're particularly excited about the opportunity to share some of our research as an oral presentation!
@Massastrello
will share more about "Dissecting Neural ODEs" the coming days. I suspect we'll never get another paper at NeurIPS after exhausting our pool of luck this year :D
@togethercompute
Try on the Together API:
Here is an example notebook:
Prompting and sampling is different from chat models. To start generating, we recommend turning off temperature and top p (temperature 1, top p 1, top k 4).
@Massastrello
@Diffeq_ml
[4/6] Although we are planning to remain in PyTorch for torchdyn, with further integration with torchcde
@PatrickKidger
and torchsde
@lxuechen
, we have long-term plans to extend the DiffEqML ecosystem to JAX and Julia.
This also means we'll have more bandwidth for development, (including some of the methods in the papers above). We're still committed to providing a complete neural diff. equations API for PyTorch /
@PyTorchLightnin
, and we have some new additions coming!
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset.
It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training.
Our 'The Symbiosis of Deep Learning and Differential Equations' workshop has been accepted for
#NeurIPS2021
!
Send us your work on data-driven dynamical systems, neural differential equations, solving PDEs with deep learning etc.
Tentative submission deadline Sept. 17.
@shoyer
@GoogleAI
Cool work! Working on residuals seems to be an effective way to go for these solver-neural net hybrids -- we found similar gains on a different set of tasks with hypersolvers ()
@Massastrello
Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at
#NeurIPS2023
!
Let's dive in to what's new with the paper and the new goodies from this release:
Monarch
@DanielePanozzo
@NeurIPSConf
Yes! That's also why hybrid is a promising way forward
Nice work. Any reason for not including a benchmark dedicated for high stiffness (e.g Robertson)? Or alternatively tuning parameters of NS and other systems to gradually increase stiffness.
@unsorsodicorda
We've been having fun with the experimental version and it has been very useful for some of our new stuff. Definitely brings PyTorch one step closer to JAX on that front (though there are still limitations of course)...
@Massastrello
@DavidDuvenaud
@chuckberryfinn
[7/n] This work is part of a *vast* literature on neural network differential equation solvers, though our focus is on Neural ODEs and their interplay with the solver. The code will be released soon as part of a research section of the *torchdyn* library:
We finally got around to open-sourcing more Neural ODE variants in the "torchdyn" library , including our latest "stacked neural ODEs" aka continuous-depth models with piece-wise constant parameters.
@MichaelPoli6
Scaling laws are different on DNA data at nucleotide resolution (and more broadly, on sequences at byte resolution). Scaling laws seem to hold strong (both on and off the compute-optimal frontier), so I am excited to see what larger models could do.
@danrothenberg
Hopefully things will change rapidly now that DeepMind (climate team), Microsoft (AI4Science), NVIDIA & more are shifting some of their focus towards deep learning climate and weather.
@BlancheMinerva
I understand what you're trying to say, but it does matter - these are architectures that are not fully supported on HF, so a lot of additional work needs to be done to ensure people can run inference easily enough to work on them. Hence, interest (we have released checkpoints
@mmbronstein
@b_p_chamberlain
@migorinova
@stefan_webb
@emaros96
Cool to see you also got pretty good results on Cora - Citeseer - Pubmed with GDEs! FYI, we're releasing an extended version of the original with SDE + GNN for dynamic graphs and a latent var model which might be of interest to some of you.
Evo represents the culmination of a long line of research on deep signal processing: new layer primitives, architecture topologies, scaling law analysis, initialization schemes, custom inference algorithms.
Here we study why certain cues are intrinsically preferred by ERM, irrespective of their dataset frequencies (we normalize their predictive power by ensuring a task can be solved with any single cue alone). Interesting takeaways i.e. color cues always dominate in visual tasks.