reza_byt Profile Banner
Reza Bayat Profile
Reza Bayat

@reza_byt

Followers
1K
Following
2K
Media
64
Statuses
578

Student at @Mila_Quebec with @AaronCourville and Pascal Vincent

Joined November 2021
Don't wanna be here? Send us removal request.
@reza_byt
Reza Bayat
4 months
📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.
4
63
269
@reza_byt
Reza Bayat
6 hours
You can't imagine how unfair life is for people in particular regions of the world. F*CK
0
0
4
@mirrokni
Vahab Mirrokni
4 days
Excited to share our new #NeurIPS2025 paper on Nested Learning (NL) 🧠 Inspired by the brain's multi-time-scale processing, NL allows different components of a neural network to be updated at different frequencies. Read the details in our post 👇 https://t.co/nZVh0PJ0UC 1/4
Tweet card summary image
research.google
@GoogleResearch
Google Research
5 days
Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL
6
19
259
@behrouz_ali
Ali Behrouz
5 days
Excited to announce our work on Nested Learning that also recently accepted to NeurIPS 2025! Stay tuned for the full version on arXiv (in the next few days) and then I'll discuss more details about the intuition behind its design and why we believe it can help with continual
@GoogleResearch
Google Research
5 days
Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL
23
56
704
@dohmatobelvis
Elvis Dohmatob
6 days
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
1
4
14
@mpezeshki91
Mohammad Pezeshki
5 days
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
@dohmatobelvis
Elvis Dohmatob
6 days
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
0
8
14
@reza_byt
Reza Bayat
6 days
Thank you, @thinkymachines! 🙏
8
1
178
@bose_joey
Joey Bose
7 days
Come do a PhD with me 😀! Promise of fun science and great coffee ☕
@giladturok
Gilad
8 days
I like the way @joeybos lays out his vision for PhD supervision! Seems intense and rewarding.
31
70
734
@reza_byt
Reza Bayat
8 days
Learning math with ChatGPT is a whole different experience; it moves from learning abstraction to actually learning the application of equations, etc. I’m not sure if this just shows I’m dumber than an AI or if I’m really learning something. If the AI can do things better than
0
1
7
@reza_byt
Reza Bayat
11 days
Ablations are the failures you’ve had along the way...
@MillionInt
Jerry Tworek
11 days
Ablations are for the weak
0
0
0
@askalphaxiv
alphaXiv
13 days
Introducing personalized arXiv feeds 🚀 Rate 5 papers, get a feed tailored to your research that learns what you care about Find the papers you need in seconds, not hours
5
14
86
@RidgerZhu
Rui-Jie (Ridger) Zhu
13 days
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
20
137
627
@tianyu_zh
Tianyu Zhang
13 days
A looped and a standard CoT LLM not only differ on whether the embeds are included in context/KV. It’s also the training scheme: at each loop, the model is trained with the LM loss which forces it to re-target the ans iteratively (AKA "thinking"). See our paper for a deeper dive!
@RidgerZhu
Rui-Jie (Ridger) Zhu
13 days
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
4
6
96
@mpezeshki91
Mohammad Pezeshki
13 days
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
@divyat09
Divyat Mahajan
14 days
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned
0
14
31
@reza_byt
Reza Bayat
14 days
This is such a misleading table. The RL cost is for exploration/discovery, but the OnPD cost is basically for transfer. Without spending the RL cost, you wouldn’t have the knowledge to transfer.
0
0
3
@divyat09
Divyat Mahajan
14 days
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned
10
46
217
@reza_byt
Reza Bayat
14 days
ChatGPT killed the em dash in professional writing—and Cursor is killing try/catch in code.
1
0
5
@Mila_Quebec
Mila - Institut québécois d'IA
16 days
Congratulations to @Yoshua_Bengio, founder and scientific advisor of Mila, who has become the first researcher in the world to surpass one million citations on Google Scholar, the leading platform for academic and scientific research. A remarkable milestone that highlights the
Tweet card summary image
mila.quebec
Yoshua Bengio, the most-cited researcher in the world has become the first living scientist to surpass one million citations on Google Scholar.
@LawZero_
LawZero - LoiZéro
16 days
Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!
10
56
410
@LawZero_
LawZero - LoiZéro
16 days
Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!
6
24
154
@reza_byt
Reza Bayat
18 days
1M citations! 🤯 Congratulations, @Yoshua_Bengio! Such an honor to have learned so much from you.
16
88
1K
@mrdanieldsouza
Daniel D'souza 
19 days
There so many cheap optimizations 🪄 in the synthetic data space that can drive real downstream gains 📈 in multilingual models 🗺️ The talented @DavidM4302 🌟 explores this very topic in his Research Scholar Program @Cohere_Labs 💙 🔥Check out his work! 👇
@DavidM4302
David
20 days
Can we synthetically generate data that truly captures a language’s richness instead of just translating English datasets? That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data generation: The Art of Asking 🗣️
1
4
17