benjamintherien Profile Banner
Benjamin Thérien Profile
Benjamin Thérien

@benjamintherien

Followers
356
Following
591
Media
38
Statuses
225

Ph.D. student at UdeM & Mila | Incoming Intern at Meta NYC | Distributed training & creating learned optimizers that generalize

Montréal, Québec
Joined November 2018
Don't wanna be here? Send us removal request.
@benjamintherien
Benjamin Thérien
6 months
Is AdamW the best inner optimizer for DiLoCo? Does the inner optimizer affect the compressibility of the DiLoCo delta? Excited to introduce MuLoCo: Muon is a practical inner optimizer for DiLoCo! 🧵 https://t.co/62OVigYWpt 1/N
2
27
86
@peholderrieth
Peter Holderrieth
1 month
New work: “GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models”. GLASS generates images by sampling stochastic Markov transitions with ODEs - allowing us to boost text-image alignment for large-scale models at inference time! https://t.co/unsuG3mYer [1/7]
4
61
249
@siddarthv66
Siddarth Venkatraman
2 months
NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!
23
103
786
@irinarish
Irina Rish
3 months
Excited to join as CSO @certeqlab, a young startup https://t.co/tAbgqXOUXw focused on scaling universal (multimodal) foundation models towards accurate forecasting and optimal decision-making in complex dynamical systems, e.g. financial markets. We are hiring!
Tweet card summary image
certeq.com
CertEq is a fintech platform for fundamental AI research in financial markets
@certeqlab
CertEq
3 months
🚀 Join @certeqlab AI Research Lab leading by Prof. @irinarish. Building Universal Foundation Models unifying time-series, text, graphs & more to advance prediction & decision-making in financial markets. Freedom to publish, big compute, real impact. Apply
3
6
42
@amir_sarfi
Amir Sarfi
3 months
Introducing SparseLoCo: a communication-efficient method for LLM pre-training. TL;DR: We leverage Top-k sparsification + error feedback with DiLoCo’s infrequent outer steps—communicating only 1–3% gradients with 2-bit quantization—outperforming DiLoCo and DeMo. 1/N, ArXiv:
10
35
155
@benjamintherien
Benjamin Thérien
3 months
What if I told you how to outperform DiLoCo while only communicating 1-3% of the pseudogradient? https://t.co/ltNtZjxpfn
@amir_sarfi
Amir Sarfi
3 months
Introducing SparseLoCo: a communication-efficient method for LLM pre-training. TL;DR: We leverage Top-k sparsification + error feedback with DiLoCo’s infrequent outer steps—communicating only 1–3% gradients with 2-bit quantization—outperforming DiLoCo and DeMo. 1/N, ArXiv:
0
7
29
@Mila_Quebec
Mila - Institut québécois d'IA
3 months
Congratulations to Irina Rish (@irinarish) Core Academic Member at Mila, on being named a Principal Investigator for the @SimonsFdn's new Simons Collaboration on the Physics of Learning and Neural Computation. This initiative brings together researchers from diverse disciplines
Tweet card summary image
simonsfoundation.org
Simons Foundation Launches Collaboration on the Physics of Learning and Neural Computation on Simons Foundation
@SimonsFdn
Simons Foundation
3 months
Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine:
1
2
27
@ebelilov
Eugene Belilovsky
3 months
I'm recruiting a Postdoc to join my group at Mila and Concordia University. Possible topics: distributed learning, learned optimizers, federated learning, other topics in efficient training (scaling laws, architecture search etc). More info here:
0
4
8
@mirandrom
Andrei Mircea
4 months
Interested in LLM training dynamics and scaling laws? Come to our #ACL2025 oral tomorrow! ⏰ Tuesday 2:55pm 📍 Hall C (Language Modeling 1) 🌐 https://t.co/LgnSGOKjqU If you're in Vienna and want to chat, let me know! @Mila_Quebec
@mirandrom
Andrei Mircea
4 months
Step 1: Understand how scaling improves LLMs. Step 2: Directly target underlying mechanism. Step 3: Improve LLMs independent of scale. Profit. In our ACL 2025 paper we look at Step 1 in terms of training dynamics. Project: https://t.co/4mkBALoilL Paper: https://t.co/CxBxbuZqgC
0
7
16
@janson002
Paul Janson
4 months
🧵 Super excited to present at two @icmlconf 2025 workshops in Vancouver 🇨🇦🍁!
1
1
5
@mirandrom
Andrei Mircea
4 months
Step 1: Understand how scaling improves LLMs. Step 2: Directly target underlying mechanism. Step 3: Improve LLMs independent of scale. Profit. In our ACL 2025 paper we look at Step 1 in terms of training dynamics. Project: https://t.co/4mkBALoilL Paper: https://t.co/CxBxbuZqgC
6
34
199
@MassCaccia
Massimo Caccia
4 months
🎉 Our paper “𝐻𝑜𝑤 𝑡𝑜 𝑇𝑟𝑎𝑖𝑛 𝑌𝑜𝑢𝑟 𝐿𝐿𝑀 𝑊𝑒𝑏 𝐴𝑔𝑒𝑛𝑡: 𝐴 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑖𝑠” got an 𝐨𝐫𝐚𝐥 at next week’s 𝗜𝗖𝗠𝗟 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗼𝗻 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗨𝘀𝗲 𝗔𝗴𝗲𝗻𝘁𝘀! 🖥️🧠 We present the 𝐟𝐢𝐫𝐬𝐭 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞
6
53
218
@PandaAshwinee
Ashwinee Panda
4 months
our paper on CPT of MoEs was rejected from #COLM2025 w/scores of 8775. the only reject said "I decide between 5 and 6". we emailed PCs, but just got "We are sorry, but the venue simply does not have the capacity to provide feedback at a more granular level." from @yoavartzi. 🙁
2
7
53
@benjamintherien
Benjamin Thérien
5 months
Tired of tuning hyperparameters? Introducing PyLO! We’re bringing hyperparameter-free learned optimizers to PyTorch with drop in torch.optim support and faster step times thanks to our custom cuda kernels. Check out our code here:
Tweet card summary image
github.com
An efficient implementation of learned optimizers in PyTorch - Belilovsky-Lab/pylo
@janson002
Paul Janson
5 months
Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N
2
7
31
@Luke22R
Luke Rowe
5 months
🚀 Our method, Poutine, was the best-performing entry in the 2025 Waymo Vision-based End-to-End Driving Challenge at #CVPR2025! Our 3 B-parameter VLM Poutine scored 7.99 RFS on the official test set—comfortably ahead of every other entry (see figure).
3
11
21
@emilianopp_
Emiliano Penaloza
5 months
Excited that our paper "Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization" was accepted to ICML 2025! We show how Preference Optimization can reduce the impact of noisy concept labels in CBMs. 🧵/9
1
24
36
@majdi_has
Majdi Hassan
5 months
(1/n)🚨You can train a model solving DFT for any geometry almost without training data!🚨 Introducing Self-Refining Training for Amortized Density Functional Theory — a variational framework for learning a DFT solver that predicts the ground-state solutions for different
3
41
156
@QuentinAnthon15
Quentin Anthony
6 months
Inspired by “minimal implementation“ projects in AI such as @karpathy’s nanoGPT, I worked to bring this concept to the HPC world! I’ve built a minimal implementation of an MPI library called nanoMPI, which focuses on clarity, simplicity, and easy installation.
12
36
306
@benjamintherien
Benjamin Thérien
6 months
In a much smaller scale setup (3 layer, width 128, 2-heads) not reported on in the paper, we swept Outer LR, Local Max LR, and EF beta for DiLoCo and MuLoCo. 12/N
0
0
1
@benjamintherien
Benjamin Thérien
6 months
@KyleLiang5
Kaizhao Liang
6 months
Tried Muon with DiLoco on 16 GPUs for 100M Llama Takeaway🧐: You get 2x peak throughput, but worse loss at the beginning. Disclaimer: this is the one with SGD as outer optimizer, the only one that doesn't require extra local memory DiLoco: https://t.co/l9oGYmLBTV
1
0
3
@benjamintherien
Benjamin Thérien
6 months
Here are some cool existing threads on X investigating DiLoCo with Muon as the inner optimizer: Thread 1. 10/N https://t.co/RdHT6fbASL
@hi_tysam
Fern
6 months
DiLoCo on modded-nanoGPT: A tiny change seems to improve run-to-run std by a surprising 3-5x (!!!!!!) and performance over the baseline from ~31.1% -> ~32.6%. More predictable runs == ↑↑↑ cheaper experimentation! Also yields a new DiLoCo interpretation! Brief details below.
1
0
1