Rosinality
@rosinality
Followers
4K
Following
23K
Media
536
Statuses
33K
ML Engineer
Seoul, Korea
Joined October 2008
Our paper "Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction" is out! https://t.co/DUTSj4T6X2 A thread summarizing the key take-aways ⬇️
arxiv.org
Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been...
7
36
235
This work made me read Cosma Shalizi's note on next token prediction again.
[LG] Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction M Blondel, M E. Sander, G Vivier-Ardisson, T Liu... [Google DeepMind] (2025) https://t.co/14JfUMWBRJ
1
7
88
Unifying discrete, gaussian, simplex diffusion. The main idea is copying a letter N times in discrete diffusion, and connecting it to gaussian diffusion using CLT when N → ∞. If we further introduce the concept of reproduction, then it is connected to simplex diffusion.
2
9
66
Meta RL which connects multiple episodes and does in-context adaptation of the policy.
1
7
87
Sometimes a simple baseline with clip higher just works. But when?
4
35
286
Top-1 routing out of 96 experts. Typical load balancing does not work for this degree of sparsity, especially for lower layers. Thus they used a larger top-K and later turned to top-1. But would this be better compared to common settings?
3
12
91
Simple vision pretraining by predicting next step embedding. The embedding itself is trained along with this while stop grad is applied when it is used as a target.
4
52
399
Scaling up the MAE. Larger block masking, deeper decoder, and more CLS tokens.
1
15
99
I suspect that once a lab reaches a certain point it can reach the frontier directly.
1
0
10
Assertion that compositional generalization may only be possible with generative models. For generative models, z -> x, we can impose structure on the OOD area of z as it is possible to give it simple structures. But it is impossible with non-generative models, x -> z, because it
3
3
47