
Albert Gu
@_albertgu
Followers
15K
Following
2K
Media
42
Statuses
374
assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.
Joined December 2018
RT @AI21Labs: Now live. A new update to our Jamba open model family 🎉. Same hybrid SSM-Transformer architecture, 256K context window, effic….
0
21
0
I really like this result: an elegant framing and solution to significantly improve length generalization in recurrent models at large (RNNs/SSMs/linear attention/etc). This has significant implications for the problems architecture researchers should focus on, IMO.
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!
1
13
112
RT @TencentHunyuan: 🚀 Introducing Hunyuan-A13B, our latest open-source LLM. As an MoE model, it leverages 80B total parameters with just 1….
0
267
0
RT @chrisdonahuey: Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generati….
0
80
0
RT @cartesia_ai: Ab live hai: India ke liye Sonic TTS, Hinglish mein 🇮🇳 | Now live: Sonic TTS in Hinglish for India 🇮🇳 .♒ Fluid transitions….
0
4
0
exciting to see that hybrid models maintain reasoning performance with few attention layers. benefits of linear architectures are prominent for long reasoning traces, when efficiency is bottlenecked by decoding - seems like a free win if reasoning ability is preserved as well!.
👀 Nemotron-H tackles large-scale reasoning while maintaining speed -- with 4x the throughput of comparable transformer models.⚡. See how #NVIDIAResearch accomplished this using a hybrid Mamba-Transformer architecture, and model fine-tuning ➡️
1
12
93
RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….
0
40
0
RT @teortaxesTex: got lost in the noise but Falcon has released a series of models H1 and from the blog post it seems they've drastically r….
0
20
0
We dug into in-depth mechanistic differences between Transformers and SSMs:.1.SSMs are very strong at sequence modeling, but worse at certain algorithmic “skills” such as retrieval.2. The gap appears only in a few heads.3. This provides insight and improved designs for hybrid.
The Transformer–SSM retrieval gap is driven by just a few heads!. SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/
6
19
161
RT @RaghuGanti: 🚀 Bamba v2 (9B) is here: faster, stronger, and smarter!.A leaderboard model in just 3T tokens!!. Bamba v1 +1T tokens of tra….
0
21
0
RT @electronickale: ✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images?. PRIS….
0
31
0
We started off investigating applications of SSMs to PDEs but evolved to a broader question of understanding memory in modeling PDEs, finding when combining a sequence model (e.g. S4) with a Markovian neural operator (e.g. FNO) has advantages. Led by CMU students Ricardo and.
What is the role of memory for modeling time dependent PDEs?. I will be at ICLR presenting our paper (Oral) where we study when it is beneficial for modeling time-dependent PDEs!.đź”—[Oral]: Thu 24 Apr 10:30 am @ Session 1E.[Poster]: Thu 24 Apr 3 pm #617.
1
5
55
Lyra shows that biology rewards the right inductive biases! Careful architectural design can significantly improve performance and efficiency for modeling biological sequences.
🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achieving SOTA performance across DNA, RNA, and protein tasks—yet up to 120,000x smaller than foundation models (ESM, Evo). Bonus: you can train it on your Mac. read
1
1
33
Announcing Cartesia’s Series A towards our mission of building real-time intelligence. I’m cooking up some new models in the back - looking for researchers who want to develop the next generation of architectures 👀.
We've raised a $64M Series A led by @kleinerperkins to build the platform for real-time voice AI. We'll use this funding to expand our team, and to build the next generation of models, infrastructure, and products for voice, starting with Sonic 2.0, available today. Link below
6
10
124
RT @TXhunyuan: 🚀 Introducing Hunyuan-TurboS – the first ultra-large Hybrid-Transformer-Mamba MoE model!.Traditional pure Transformer models….
0
223
0
an important problem for alternative architectures is figuring out how to leverage the much more established Transformer ecosystem to bootstrap new models. we scaled up Aviv's MOHAWK framework for "architecture distillation" to produce strong Llama -> Mamba (Llamba) models.
🔥 Llama-level performance with <0.1% of the training data 🔥. Together with @cartesia_ai, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba. ⚡ Sizes: 1B, 3B, 8B.🚀 Optimized for speed & on-device efficiency. Details here 🧵👇
2
10
82
Isaac has been interested in a general compression-based theory of intelligence. He explored this on ARC-AGI to get interesting results with a very different approach!.
Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4
2
13
169
RT @iScienceLuvr: Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners. Distilling Llama-1B and -3B models with only 8 b….
0
91
0
RT @tri_dao: I've been excited about this for a while: a simple architectural change to the residual connection that allows arbitrary overl….
0
66
0