Albert Gu Profile
Albert Gu

@_albertgu

Followers
15K
Following
2K
Media
42
Statuses
374

assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.

Joined December 2018
Don't wanna be here? Send us removal request.
@_albertgu
Albert Gu
10 hours
RT @AI21Labs: Now live. A new update to our Jamba open model family 🎉. Same hybrid SSM-Transformer architecture, 256K context window, effic….
0
21
0
@_albertgu
Albert Gu
13 hours
I really like this result: an elegant framing and solution to significantly improve length generalization in recurrent models at large (RNNs/SSMs/linear attention/etc). This has significant implications for the problems architecture researchers should focus on, IMO.
@rbuit_
Ricardo Buitrago
13 hours
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!
Tweet media one
1
13
112
@_albertgu
Albert Gu
11 days
RT @TencentHunyuan: 🚀 Introducing Hunyuan-A13B, our latest open-source LLM. As an MoE model, it leverages 80B total parameters with just 1….
0
267
0
@_albertgu
Albert Gu
17 days
RT @chrisdonahuey: Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generati….
0
80
0
@_albertgu
Albert Gu
21 days
RT @cartesia_ai: Ab live hai: India ke liye Sonic TTS, Hinglish mein 🇮🇳 | Now live: Sonic TTS in Hinglish for India 🇮🇳 .♒ Fluid transitions….
0
4
0
@_albertgu
Albert Gu
27 days
exciting to see that hybrid models maintain reasoning performance with few attention layers. benefits of linear architectures are prominent for long reasoning traces, when efficiency is bottlenecked by decoding - seems like a free win if reasoning ability is preserved as well!.
@NVIDIAAIDev
NVIDIA AI Developer
1 month
👀 Nemotron-H tackles large-scale reasoning while maintaining speed -- with 4x the throughput of comparable transformer models.⚡. See how #NVIDIAResearch accomplished this using a hybrid Mamba-Transformer architecture, and model fine-tuning ➡️
Tweet media one
1
12
93
@_albertgu
Albert Gu
1 month
RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….
0
40
0
@_albertgu
Albert Gu
2 months
RT @teortaxesTex: got lost in the noise but Falcon has released a series of models H1 and from the blog post it seems they've drastically r….
0
20
0
@_albertgu
Albert Gu
2 months
We dug into in-depth mechanistic differences between Transformers and SSMs:.1.SSMs are very strong at sequence modeling, but worse at certain algorithmic “skills” such as retrieval.2. The gap appears only in a few heads.3. This provides insight and improved designs for hybrid.
@avivbick
Aviv Bick
2 months
The Transformer–SSM retrieval gap is driven by just a few heads!. SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/
Tweet media one
6
19
161
@_albertgu
Albert Gu
2 months
RT @RaghuGanti: 🚀 Bamba v2 (9B) is here: faster, stronger, and smarter!.A leaderboard model in just 3T tokens!!. Bamba v1 +1T tokens of tra….
0
21
0
@_albertgu
Albert Gu
2 months
RT @electronickale: ✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images?. PRIS….
0
31
0
@_albertgu
Albert Gu
3 months
We started off investigating applications of SSMs to PDEs but evolved to a broader question of understanding memory in modeling PDEs, finding when combining a sequence model (e.g. S4) with a Markovian neural operator (e.g. FNO) has advantages. Led by CMU students Ricardo and.
@__tm__157
Tanya Marwah
3 months
What is the role of memory for modeling time dependent PDEs?. I will be at ICLR presenting our paper (Oral) where we study when it is beneficial for modeling time-dependent PDEs!.đź”—[Oral]: Thu 24 Apr 10:30 am @ Session 1E.[Poster]: Thu 24 Apr 3 pm #617.
1
5
55
@_albertgu
Albert Gu
4 months
Lyra shows that biology rewards the right inductive biases! Careful architectural design can significantly improve performance and efficiency for modeling biological sequences.
@KrithikTweets
Krithik Ramesh
4 months
🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achieving SOTA performance across DNA, RNA, and protein tasks—yet up to 120,000x smaller than foundation models (ESM, Evo). Bonus: you can train it on your Mac. read
Tweet media one
1
1
33
@_albertgu
Albert Gu
4 months
Announcing Cartesia’s Series A towards our mission of building real-time intelligence. I’m cooking up some new models in the back - looking for researchers who want to develop the next generation of architectures 👀.
@cartesia_ai
Cartesia
4 months
We've raised a $64M Series A led by @kleinerperkins to build the platform for real-time voice AI. We'll use this funding to expand our team, and to build the next generation of models, infrastructure, and products for voice, starting with Sonic 2.0, available today. Link below
6
10
124
@_albertgu
Albert Gu
4 months
RT @TXhunyuan: 🚀 Introducing Hunyuan-TurboS – the first ultra-large Hybrid-Transformer-Mamba MoE model!.Traditional pure Transformer models….
0
223
0
@_albertgu
Albert Gu
4 months
Llamba: MOHAWK: in the course of this project, we found intriguing mechanistic differences between Transformers and SSMs. to be released next🤞.
0
0
5
@_albertgu
Albert Gu
4 months
an important problem for alternative architectures is figuring out how to leverage the much more established Transformer ecosystem to bootstrap new models. we scaled up Aviv's MOHAWK framework for "architecture distillation" to produce strong Llama -> Mamba (Llamba) models.
@avivbick
Aviv Bick
4 months
🔥 Llama-level performance with <0.1% of the training data 🔥. Together with @cartesia_ai, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba. ⚡ Sizes: 1B, 3B, 8B.🚀 Optimized for speed & on-device efficiency. Details here 🧵👇
Tweet media one
2
10
82
@_albertgu
Albert Gu
4 months
Isaac has been interested in a general compression-based theory of intelligence. He explored this on ARC-AGI to get interesting results with a very different approach!.
@LiaoIsaac91893
Isaac Liao
4 months
Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4
Tweet media one
Tweet media two
2
13
169
@_albertgu
Albert Gu
4 months
RT @iScienceLuvr: Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners. Distilling Llama-1B and -3B models with only 8 b….
0
91
0
@_albertgu
Albert Gu
5 months
RT @tri_dao: I've been excited about this for a while: a simple architectural change to the residual connection that allows arbitrary overl….
0
66
0