Reza Bayat @reza_byt X Profile

Reza Bayat

@reza_byt

Followers

1K

Following

2K

Media

64

Statuses

578

Student at @Mila_Quebec with @AaronCourville and Pascal Vincent

https://t.co/BAJeIDEBXl

Joined November 2021

Don't wanna be here? Send us removal request.

Reza Bayat

@reza_byt

4 months

📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.

4

63

269

Reza Bayat

@reza_byt

6 hours

You can't imagine how unfair life is for people in particular regions of the world. F*CK

0

4

Vahab Mirrokni

@mirrokni

4 days

Excited to share our new #NeurIPS2025 paper on Nested Learning (NL) 🧠 Inspired by the brain's multi-time-scale processing, NL allows different components of a neural network to be updated at different frequencies. Read the details in our post 👇 https://t.co/nZVh0PJ0UC 1/4

research.google

Google Research

@GoogleResearch

5 days

Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL

6

19

259

Ali Behrouz

@behrouz_ali

5 days

Excited to announce our work on Nested Learning that also recently accepted to NeurIPS 2025! Stay tuned for the full version on arXiv (in the next few days) and then I'll discuss more details about the intuition behind its design and why we believe it can help with continual

Google Research

@GoogleResearch

5 days

Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: https://t.co/fpdDlYaleL

23

56

704

Elvis Dohmatob

@dohmatobelvis

6 days

1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram

1

4

14

Mohammad Pezeshki

@mpezeshki91

5 days

We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities

Elvis Dohmatob

@dohmatobelvis

6 days

1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram

0

8

14

Reza Bayat

@reza_byt

6 days

Thank you, @thinkymachines! 🙏

8

1

178

Joey Bose

@bose_joey

7 days

Come do a PhD with me 😀! Promise of fun science and great coffee ☕

Gilad

@giladturok

8 days

I like the way @joeybos lays out his vision for PhD supervision! Seems intense and rewarding.

31

70

734

Reza Bayat

@reza_byt

8 days

Learning math with ChatGPT is a whole different experience; it moves from learning abstraction to actually learning the application of equations, etc. I’m not sure if this just shows I’m dumber than an AI or if I’m really learning something. If the AI can do things better than

0

1

7

Reza Bayat

@reza_byt

11 days

Ablations are the failures you’ve had along the way...

Jerry Tworek

@MillionInt

11 days

Ablations are for the weak

0

alphaXiv

@askalphaxiv

13 days

Introducing personalized arXiv feeds 🚀 Rate 5 papers, get a feed tailored to your research that learns what you care about Find the papers you need in seconds, not hours

5

14

86

Rui-Jie (Ridger) Zhu

@RidgerZhu

13 days

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

20

137

627

Tianyu Zhang

@tianyu_zh

13 days

A looped and a standard CoT LLM not only differ on whether the embeds are included in context/KV. It’s also the training scheme: at each loop, the model is trained with the LM loss which forces it to re-target the ans iteratively (AKA "thinking"). See our paper for a deeper dive!

Rui-Jie (Ridger) Zhu

@RidgerZhu

13 days

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

4

6

96

Mohammad Pezeshki

@mpezeshki91

13 days

My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.

Divyat Mahajan

@divyat09

14 days

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned

0

14

31

Reza Bayat

@reza_byt

14 days

This is such a misleading table. The RL cost is for exploration/discovery, but the OnPD cost is basically for transfer. Without spending the RL cost, you wouldn’t have the knowledge to transfer.

0

3

Divyat Mahajan

@divyat09

14 days

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned

10

46

217

Reza Bayat

@reza_byt

14 days

ChatGPT killed the em dash in professional writing—and Cursor is killing try/catch in code.

1

0

5

Mila - Institut québécois d'IA

@Mila_Quebec

16 days

Congratulations to @Yoshua_Bengio, founder and scientific advisor of Mila, who has become the first researcher in the world to surpass one million citations on Google Scholar, the leading platform for academic and scientific research. A remarkable milestone that highlights the

mila.quebec

Yoshua Bengio, the most-cited researcher in the world has become the first living scientist to surpass one million citations on Google Scholar.

LawZero - LoiZéro

@LawZero_

16 days

Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!

10

56

410

LawZero - LoiZéro

@LawZero_

16 days

Our Founder and Scientific Director @Yoshua_Bengio has become the first living researcher to surpass 1 million citations on Google Scholar, a testament to the foundational and global impact of his work. Congratulations Yoshua!

6

24

154

Reza Bayat

@reza_byt

18 days

1M citations! 🤯 Congratulations, @Yoshua_Bengio! Such an honor to have learned so much from you.

16

88

1K

Daniel D'souza 

@mrdanieldsouza

19 days

There so many cheap optimizations 🪄 in the synthetic data space that can drive real downstream gains 📈 in multilingual models 🗺️ The talented @DavidM4302 🌟 explores this very topic in his Research Scholar Program @Cohere_Labs 💙 🔥Check out his work! 👇

David

@DavidM4302

20 days

Can we synthetically generate data that truly captures a language’s richness instead of just translating English datasets? That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data generation: The Art of Asking 🗣️

1

4

17