Reza Bayat
@reza_byt
Followers
1K
Following
2K
Media
64
Statuses
578
Student at @Mila_Quebec with @AaronCourville and Pascal Vincent
Joined November 2021
📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.
4
63
269
You can't imagine how unfair life is for people in particular regions of the world. F*CK
0
0
4
Excited to share our new #NeurIPS2025 paper on Nested Learning (NL) 🧠 Inspired by the brain's multi-time-scale processing, NL allows different components of a neural network to be updated at different frequencies. Read the details in our post 👇 https://t.co/nZVh0PJ0UC 1/4
research.google
Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL
6
19
259
Excited to announce our work on Nested Learning that also recently accepted to NeurIPS 2025! Stay tuned for the full version on arXiv (in the next few days) and then I'll discuss more details about the intuition behind its design and why we believe it can help with continual
Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL
23
56
704
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
1
4
14
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
0
8
14
Come do a PhD with me 😀! Promise of fun science and great coffee ☕
31
70
734
Learning math with ChatGPT is a whole different experience; it moves from learning abstraction to actually learning the application of equations, etc. I’m not sure if this just shows I’m dumber than an AI or if I’m really learning something. If the AI can do things better than
0
1
7
Introducing personalized arXiv feeds 🚀 Rate 5 papers, get a feed tailored to your research that learns what you care about Find the papers you need in seconds, not hours
5
14
86
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
20
137
627
A looped and a standard CoT LLM not only differ on whether the embeds are included in context/KV. It’s also the training scheme: at each loop, the model is trained with the LM loss which forces it to re-target the ans iteratively (AKA "thinking"). See our paper for a deeper dive!
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
4
6
96
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned
0
14
31
This is such a misleading table. The RL cost is for exploration/discovery, but the OnPD cost is basically for transfer. Without spending the RL cost, you wouldn’t have the knowledge to transfer.
0
0
3
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned
10
46
217
ChatGPT killed the em dash in professional writing—and Cursor is killing try/catch in code.
1
0
5
Congratulations to @Yoshua_Bengio, founder and scientific advisor of Mila, who has become the first researcher in the world to surpass one million citations on Google Scholar, the leading platform for academic and scientific research. A remarkable milestone that highlights the
mila.quebec
Yoshua Bengio, the most-cited researcher in the world has become the first living scientist to surpass one million citations on Google Scholar.
Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!
10
56
410
Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!
6
24
154
1M citations! 🤯 Congratulations, @Yoshua_Bengio! Such an honor to have learned so much from you.
16
88
1K
There so many cheap optimizations 🪄 in the synthetic data space that can drive real downstream gains 📈 in multilingual models 🗺️ The talented @DavidM4302 🌟 explores this very topic in his Research Scholar Program @Cohere_Labs 💙 🔥Check out his work! 👇
Can we synthetically generate data that truly captures a language’s richness instead of just translating English datasets? That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data generation: The Art of Asking 🗣️
1
4
17