Albert Gu Profile
Albert Gu

@_albertgu

Followers
18K
Following
2K
Media
45
Statuses
472

assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.

Joined December 2018
Don't wanna be here? Send us removal request.
@_albertgu
Albert Gu
4 months
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
@sukjun_hwang
Sukjun (June) Hwang
4 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
61
197
1K
@SonglinYang4
Songlin Yang
6 days
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
12
60
501
@_albertgu
Albert Gu
6 days
so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀
@Kimi_Moonshot
Kimi.ai
6 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
11
127
@_albertgu
Albert Gu
8 days
immensely proud of the team for our best model yet. grateful to be able to work with such a strong team of researchers who are always curious and willing to explore the untrodden path https://t.co/Grkgy4vaqf
15
19
326
@_albertgu
Albert Gu
12 days
at the tokenizer workshop panel at ICML, i made an offhand joke about eventually going to raw pixels being the way, i didn't press it too hard bc the pitchforks were already out over h-net and i wanted to make it home, but yes still in favor 🙋
@karpathy
Andrej Karpathy
15 days
@thawani_avijit Haha. I am afraid people interpreted my “delete tokenizer” as “use bytes directly without BPE”, the issue is you *still* need bytes encoding arbitrariness even for that! Pixels is the only way. Just like humans. It is written. If GPT-10 uses utf8 at the input I will eat a shoe.
8
13
289
@_albertgu
Albert Gu
16 days
the intuition and analogies:
1
0
17
@_albertgu
Albert Gu
16 days
I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about building
@EranMalach
Eran Malach
19 days
SSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: https://t.co/bCzxawF452 🧵
11
63
550
@_albertgu
Albert Gu
2 months
the blog post has a lot of good stuff, and also some claims that went counter to our experience. I don't have time to respond to individual points but we've swapped notes with @main_horse and he can pass on whatever he feels like 😁
1
0
39
@_albertgu
Albert Gu
2 months
really glad to see so much effort being put into furthering understanding of H-Nets! the main caveat with these results is the scale: the blog's experiments go up to 1e18 FLOPs which is around ~1000x smaller than the paper's experiments. for some grounding, this roughly
@main_horse
main
2 months
I hope to receive pushback on today's claim.
8
11
251
@SonglinYang4
Songlin Yang
2 months
hybrid is the future:)
@shxf0072
Joey (e/λ)
2 months
Qwen3-Next is hybrid GatedAttention (for outliers fix) GatedDelta net rnn for kv saving all new models will be either sink+swa hyprids like gpt oss or gated attn + linear rnn hybrids (mamba , gated deltanet etc) like qwen3-next age of pure attn for timemixing layer is over,
4
51
524
@miran_heo
Miran Heo
2 months
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝
Tweet card summary image
arxiv.org
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation...
2
28
141
@ctnzr
Bryan Catanzaro
3 months
Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the
38
240
1K
@ShreyaR
shreya rajpal
3 months
Introducing ❄️ @snowglobe_so, the simulation engine for AI chatbots. Magically simulate the behavior of your users to test and improve your chatbots. Find failures before your users do.
117
93
1K
@_albertgu
Albert Gu
3 months
1
3
58
@_albertgu
Albert Gu
3 months
a common belief is that Transformers scale well because of less inductive bias, when it actually does have specific inductive biases. we developed H-Nets not to fix tokenization, but because I think that dynamic chunking represents a fundamental primitive that captures a bias
@andrewgwils
Andrew Gordon Wilson
3 months
A common takeaway from "the bitter lesson" is we don't need to put effort into encoding inductive biases, we just need compute. Nothing could be further from the truth! Better inductive biases mean better scaling exponents, which means exponential improvements with computation.
9
44
641
@YiMaTweets
Yi Ma
3 months
I start to believe that there is some subtle difference between compression (common for all intelligence) and abstraction (unique for artificial intelligence of human). They are definitely related, but different in a fundamental way. This shall be our next major quest for AI.
38
54
451
@main_horse
main
3 months
@peterwildeford this wouldn't happen if gpt-5 were a h-net
2
1
18
@AI21Labs
AI21 Labs
3 months
Attention was never enough. The hybrid LLM era is here—and it’s moving fast. From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months. 🧵 A timeline of what’s changed and why it matters ↓ 🔗
12
99
474
@SemiAnalysis_
SemiAnalysis
3 months
Souvla is one of the go-to places for San Francisco AI researchers to get a quick bite. Most of the food is very good there and is even offered on board Delta Airlines’ first/business class. But unfortunately, their frozen yogurt is not good. Many AI researchers instead go to
36
6
400