Benjamin Thérien @benjamintherien X Profile

Benjamin Thérien

@benjamintherien

Followers

356

Following

591

Media

38

Statuses

225

Ph.D. student at UdeM & Mila | Incoming Intern at Meta NYC | Distributed training & creating learned optimizers that generalize

https://t.co/LlivSevp9Y

Montréal, Québec

Joined November 2018

Don't wanna be here? Send us removal request.

Benjamin Thérien

@benjamintherien

6 months

Is AdamW the best inner optimizer for DiLoCo? Does the inner optimizer affect the compressibility of the DiLoCo delta? Excited to introduce MuLoCo: Muon is a practical inner optimizer for DiLoCo! 🧵 https://t.co/62OVigYWpt 1/N

2

27

86

Peter Holderrieth

@peholderrieth

1 month

New work: “GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models”. GLASS generates images by sampling stochastic Markov transitions with ODEs - allowing us to boost text-image alignment for large-scale models at inference time! https://t.co/unsuG3mYer [1/7]

4

61

249

Siddarth Venkatraman

@siddarthv66

2 months

NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!

23

103

786

Irina Rish

@irinarish

3 months

Excited to join as CSO @certeqlab, a young startup https://t.co/tAbgqXOUXw focused on scaling universal (multimodal) foundation models towards accurate forecasting and optimal decision-making in complex dynamical systems, e.g. financial markets. We are hiring!

certeq.com

CertEq is a fintech platform for fundamental AI research in financial markets

CertEq

@certeqlab

3 months

🚀 Join @certeqlab AI Research Lab leading by Prof. @irinarish. Building Universal Foundation Models unifying time-series, text, graphs & more to advance prediction & decision-making in financial markets. Freedom to publish, big compute, real impact. Apply

3

6

42

Amir Sarfi

@amir_sarfi

3 months

Introducing SparseLoCo: a communication-efficient method for LLM pre-training. TL;DR: We leverage Top-k sparsification + error feedback with DiLoCo’s infrequent outer steps—communicating only 1–3% gradients with 2-bit quantization—outperforming DiLoCo and DeMo. 1/N, ArXiv:

10

35

155

Benjamin Thérien

@benjamintherien

3 months

What if I told you how to outperform DiLoCo while only communicating 1-3% of the pseudogradient? https://t.co/ltNtZjxpfn

Amir Sarfi

@amir_sarfi

3 months

Introducing SparseLoCo: a communication-efficient method for LLM pre-training. TL;DR: We leverage Top-k sparsification + error feedback with DiLoCo’s infrequent outer steps—communicating only 1–3% gradients with 2-bit quantization—outperforming DiLoCo and DeMo. 1/N, ArXiv:

0

7

29

Mila - Institut québécois d'IA

@Mila_Quebec

3 months

Congratulations to Irina Rish (@irinarish) Core Academic Member at Mila, on being named a Principal Investigator for the @SimonsFdn's new Simons Collaboration on the Physics of Learning and Neural Computation. This initiative brings together researchers from diverse disciplines

simonsfoundation.org

Simons Foundation Launches Collaboration on the Physics of Learning and Neural Computation on Simons Foundation

Simons Foundation

@SimonsFdn

3 months

Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine:

1

2

27

Eugene Belilovsky

@ebelilov

3 months

I'm recruiting a Postdoc to join my group at Mila and Concordia University. Possible topics: distributed learning, learned optimizers, federated learning, other topics in efficient training (scaling laws, architecture search etc). More info here:

0

4

8

Andrei Mircea

@mirandrom

4 months

Interested in LLM training dynamics and scaling laws? Come to our #ACL2025 oral tomorrow! ⏰ Tuesday 2:55pm 📍 Hall C (Language Modeling 1) 🌐 https://t.co/LgnSGOKjqU If you're in Vienna and want to chat, let me know! @Mila_Quebec

Andrei Mircea

@mirandrom

4 months

Step 1: Understand how scaling improves LLMs. Step 2: Directly target underlying mechanism. Step 3: Improve LLMs independent of scale. Profit. In our ACL 2025 paper we look at Step 1 in terms of training dynamics. Project: https://t.co/4mkBALoilL Paper: https://t.co/CxBxbuZqgC

0

7

16

Paul Janson

@janson002

4 months

🧵 Super excited to present at two @icmlconf 2025 workshops in Vancouver 🇨🇦🍁!

1

5

Andrei Mircea

@mirandrom

4 months

Step 1: Understand how scaling improves LLMs. Step 2: Directly target underlying mechanism. Step 3: Improve LLMs independent of scale. Profit. In our ACL 2025 paper we look at Step 1 in terms of training dynamics. Project: https://t.co/4mkBALoilL Paper: https://t.co/CxBxbuZqgC

6

34

199

Massimo Caccia

@MassCaccia

4 months

🎉 Our paper “𝐻𝑜𝑤 𝑡𝑜 𝑇𝑟𝑎𝑖𝑛 𝑌𝑜𝑢𝑟 𝐿𝐿𝑀 𝑊𝑒𝑏 𝐴𝑔𝑒𝑛𝑡: 𝐴 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑖𝑠” got an 𝐨𝐫𝐚𝐥 at next week’s 𝗜𝗖𝗠𝗟 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗼𝗻 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗨𝘀𝗲 𝗔𝗴𝗲𝗻𝘁𝘀! 🖥️🧠 We present the 𝐟𝐢𝐫𝐬𝐭 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞

6

53

218

Ashwinee Panda

@PandaAshwinee

4 months

our paper on CPT of MoEs was rejected from #COLM2025 w/scores of 8775. the only reject said "I decide between 5 and 6". we emailed PCs, but just got "We are sorry, but the venue simply does not have the capacity to provide feedback at a more granular level." from @yoavartzi. 🙁

2

7

53

Benjamin Thérien

@benjamintherien

5 months

Tired of tuning hyperparameters? Introducing PyLO! We’re bringing hyperparameter-free learned optimizers to PyTorch with drop in torch.optim support and faster step times thanks to our custom cuda kernels. Check out our code here:

github.com

An efficient implementation of learned optimizers in PyTorch - Belilovsky-Lab/pylo

Paul Janson

@janson002

5 months

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

2

7

31

Luke Rowe

@Luke22R

5 months

🚀 Our method, Poutine, was the best-performing entry in the 2025 Waymo Vision-based End-to-End Driving Challenge at #CVPR2025! Our 3 B-parameter VLM Poutine scored 7.99 RFS on the official test set—comfortably ahead of every other entry (see figure).

3

11

21

Emiliano Penaloza

@emilianopp_

5 months

Excited that our paper "Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization" was accepted to ICML 2025! We show how Preference Optimization can reduce the impact of noisy concept labels in CBMs. 🧵/9

1

24

36

Majdi Hassan

@majdi_has

5 months

(1/n)🚨You can train a model solving DFT for any geometry almost without training data!🚨 Introducing Self-Refining Training for Amortized Density Functional Theory — a variational framework for learning a DFT solver that predicts the ground-state solutions for different

3

41

156

Quentin Anthony

@QuentinAnthon15

6 months

Inspired by “minimal implementation“ projects in AI such as @karpathy’s nanoGPT, I worked to bring this concept to the HPC world! I’ve built a minimal implementation of an MPI library called nanoMPI, which focuses on clarity, simplicity, and easy installation.

12

36

306

Benjamin Thérien

@benjamintherien

6 months

In a much smaller scale setup (3 layer, width 128, 2-heads) not reported on in the paper, we swept Outer LR, Local Max LR, and EF beta for DiLoCo and MuLoCo. 12/N

0

1

Benjamin Thérien

@benjamintherien

6 months

Thread 2. 11/N https://t.co/utnMJZOdFm

Kaizhao Liang

@KyleLiang5

6 months

Tried Muon with DiLoco on 16 GPUs for 100M Llama Takeaway🧐: You get 2x peak throughput, but worse loss at the beginning. Disclaimer: this is the one with SGD as outer optimizer, the only one that doesn't require extra local memory DiLoco: https://t.co/l9oGYmLBTV

1

0

3

Benjamin Thérien

@benjamintherien

6 months

Here are some cool existing threads on X investigating DiLoCo with Muon as the inner optimizer: Thread 1. 10/N https://t.co/RdHT6fbASL

Fern

@hi_tysam

6 months

DiLoCo on modded-nanoGPT: A tiny change seems to improve run-to-run std by a surprising 3-5x (!!!!!!) and performance over the baseline from ~31.1% -> ~32.6%. More predictable runs == ↑↑↑ cheaper experimentation! Also yields a new DiLoCo interpretation! Brief details below.

1

0

1