Albert Gu
@_albertgu
Followers
18K
Following
2K
Media
45
Statuses
472
assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.
Joined December 2018
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
61
197
1K
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
12
60
501
so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
11
127
immensely proud of the team for our best model yet. grateful to be able to work with such a strong team of researchers who are always curious and willing to explore the untrodden path https://t.co/Grkgy4vaqf
15
19
326
at the tokenizer workshop panel at ICML, i made an offhand joke about eventually going to raw pixels being the way, i didn't press it too hard bc the pitchforks were already out over h-net and i wanted to make it home, but yes still in favor 🙋
@thawani_avijit Haha. I am afraid people interpreted my “delete tokenizer” as “use bytes directly without BPE”, the issue is you *still* need bytes encoding arbitrariness even for that! Pixels is the only way. Just like humans. It is written. If GPT-10 uses utf8 at the input I will eat a shoe.
8
13
289
I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about building
SSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: https://t.co/bCzxawF452 🧵
11
63
550
the blog post has a lot of good stuff, and also some claims that went counter to our experience. I don't have time to respond to individual points but we've swapped notes with @main_horse and he can pass on whatever he feels like 😁
1
0
39
really glad to see so much effort being put into furthering understanding of H-Nets! the main caveat with these results is the scale: the blog's experiments go up to 1e18 FLOPs which is around ~1000x smaller than the paper's experiments. for some grounding, this roughly
8
11
251
hybrid is the future:)
Qwen3-Next is hybrid GatedAttention (for outliers fix) GatedDelta net rnn for kv saving all new models will be either sink+swa hyprids like gpt oss or gated attn + linear rnn hybrids (mamba , gated deltanet etc) like qwen3-next age of pure attn for timemixing layer is over,
4
51
524
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝
arxiv.org
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation...
2
28
141
Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the
38
240
1K
Introducing ❄️ @snowglobe_so, the simulation engine for AI chatbots. Magically simulate the behavior of your users to test and improve your chatbots. Find failures before your users do.
117
93
1K
discussed in some recent blogs https://t.co/GlINs5ggs0
https://t.co/H3PvecTaWP
1
3
58
a common belief is that Transformers scale well because of less inductive bias, when it actually does have specific inductive biases. we developed H-Nets not to fix tokenization, but because I think that dynamic chunking represents a fundamental primitive that captures a bias
A common takeaway from "the bitter lesson" is we don't need to put effort into encoding inductive biases, we just need compute. Nothing could be further from the truth! Better inductive biases mean better scaling exponents, which means exponential improvements with computation.
9
44
641
I start to believe that there is some subtle difference between compression (common for all intelligence) and abstraction (unique for artificial intelligence of human). They are definitely related, but different in a fundamental way. This shall be our next major quest for AI.
38
54
451
Attention was never enough. The hybrid LLM era is here—and it’s moving fast. From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months. 🧵 A timeline of what’s changed and why it matters ↓ 🔗
12
99
474
Souvla is one of the go-to places for San Francisco AI researchers to get a quick bite. Most of the food is very good there and is even offered on board Delta Airlines’ first/business class. But unfortunately, their frozen yogurt is not good. Many AI researchers instead go to
36
6
400