
Arthur Douillard
@Ar_Douillard
Followers
7K
Following
17K
Media
567
Statuses
4K
distributed learning @ deepmind | DiLoCo, DiPaCo | world-wide compute arbitrage
Joined January 2016
from @JeffDean at @dwarkesh_sp podcast:. "asynchronous training where each copy of the model does local computation [. ] it makes people uncomfortable [. ] but it actually works". yep, i can confirm, it does work for real
16
53
796
I am excited to share that, after my PhD šØāš, I will join @DeepMind this summer as a Research Scientist in the Continual Learning team led by Marc'Aurelio Ranzato! š.
28
6
459
Distributed learning?. Recently, @PrimeIntellect have announced their 10B distributed learning (, what is it exactly?. Going back to the origin, Federated Learning (FL, aims to train a model across a fleet of phones. To handle the
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model. Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI š¦
17
68
450
šØ We released our work on data parallelism for language models *distributed* across the entire world!. š§µThread below š.
DiLoCo: Distributed Low-Communication Training of Language Models. paper page: Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of
17
66
373
I'm super excited to release DiPaCo, a new kind of mixture of experts, that can scale engineering-wise to data centers across the entire world!. A few words about it in this thread š§µ.
Google presents DiPaCo. Distributed Path Composition. Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high
12
49
296
šØ My team at @DeepMind is looking for a Research Engineer in Efficient Large-Scale Learning! . š. ā Unprecedented scale + efficiently adaptation to new tasks. š Distributed large-scale learning and continual learning!.
5
37
246
Don't waste flops, upcycle ā»ļø. 1. Upcyling ( @arankomatsuzaki et al.) proposes to transform a pretrained dense network into a MoE, by duplicating its MLP layers. You still have to train from scratch the (relatively small) routers, but this saves lots of
5
35
187
one more step towards decentralized learning: Eager Updates. can we overlap communication with computation over hundred of steps? .-- yes we can. in this work led by @SatyenKale, we improve DiLoCo and use x1177 less bandwidth than data-parallel
3
21
173
Very cool to see @JeffDean highlighting our DiPaCo ( in the MoE history at Google!. (see last line š)
oh wow, @JeffDean dropping incredibles amounts of lore on Gemini 1.5 Pro and 2.0 Flash this morning. the references alone can fill up the @latentspacepod paper club for a year. (this is only part of the "everything we do better than you at google" talk, he is talking about ai
3
13
154
āSome sort of Federated Learning, async distributed computing will have to work.ā.ā Jensen. on it š«”.
BG2. Ep 17. Double $NVDA! System Level Comp Moat, āInsane Demandā, Inference Explosion 1 B x, Memphis Supercluster, OpenAI, & more. @altcap @_clarktang @bgurley. (00:00) Intro. (1:50) The Evolution of AGI and Personal Assistants. (06:03) NVIDIA's
8
25
140
Very interesting. DeMo syncs only the fast components of the signal, while the other slower components are accumulated in the momentum. Two things that comes to my mind is this intuition to keeping the slow gradients (grokfast: , and reducing
So about that Nous DisTrO project, of which many very cracked people are skeptical. Now there's paper on arxiv and PyTorch implementation on github. Also it's called DeMo (Decoupled Momentum Optimization). These are results. This is the idea. I'm interested in how this ends.
5
16
113
In 2023, we showed that with our particular federated/distributed configuration DiLoCo (Adam + Nesterov = ā¤ļø, you could train relatively large (0.5B) LLM distributed across the world while being as good as centralized training. @PrimeIntellect released
3
15
105
30+ accepted papers. 6 oral papers. 6 guest speakers. join us at @iclr_conf on the 27th Hall 4 #3 for a full day of workshop on Modularity for Collaborative, Decentralized, and Continual Learning. @derylucio, Fengyuan Liu, and myself will be organizing
Workshop alert šØ. We'll host in ICLR 2025 a workshop on modularity, encompassing collaborative + decentralized + continual learning. Those topics are on the critical path to building better AIs. Interested? submit a paper and join us in Singapore!.
3
28
102
Two of my favorites blog-doing-AI-literature-review are @lilianweng's and @cwolferesearch's What other blog should I read?. Broad scope probably preferred to single deep topic.
9
7
93
this infra framework ( + using SWARM ( on the inference node to fit ultra large models is going to be the future. one step closer to the GitTheta ( dream
Releasing INTELLECT-2: Weāre open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning:. ⢠Detailed Technical Report.⢠INTELLECT-2 model checkpoint.
6
15
94
ok @iclr_conf, im logging off ā thanks for the week!
1
2
84
Very very cool, you can train now your LLM across the world with dynamic size compute pool!. Awesome work for the PrimeIntellect's team to DiLoCo ( available to everyone with lots of engineering tricks to be so efficient.
Releasing INTELLECT-1: Weāre open-sourcing the first decentralized trained 10B model:. - INTELLECT-1 base model & intermediate checkpoints.- Pre-training dataset.- Post-trained instruct models by @arcee_ai.- PRIME training framework.- Technical paper with all details
3
22
80
really like this point from @jackclarkSF's importAI. having a perfect super smart assistant decrease the importance of hard skills but increase the importance of agency and curiosity
3
14
83
World-wide decentralized training with an open-source DiLoCo is done. Note that date for history books.
We did it ā the first decentralized training of a 10B model is complete!. Trained across the US, Europe, and Asia š. Post-training with @arcee_ai is underway, and a full open-source release is coming in ~1 week, including: base model, checkpoints, post-trained model and data.
3
9
79
I've submitted my first paper ever at CVPR2020 and got rejected, it was hard. But I'm happy to announce that my third paper, PLOP, has been accepted to #CVPR2021!. Code will be released soon!.
New work from Y.Chen, A.Dapogny, @quobbe, and myself. We tackle Continual Semantic Segmentation by introducing a novel distillation loss exploiting local & global details, and an uncertainty-based pseudo-labeling handling background shift. (We are PLOP).
4
7
73
four challenges could prevent scaling according to @EpochAIResearch . 3 of them may be alleviated by DiLoCo:.1. Localized power constraint: distributed training could use power plants anywhere on earth.2. Chip production capacity: async communication allows use heterogenous
3
5
73
The first transformer designed for Continual Learning in Computer Vision has been accepted to #CVPR2022! š„³. Using a dynamic approach, it forgets less than previous ensembling methods while using fewer parameters. š»: š: š§µš
4
16
70
Run distributed training DiLoCo on Apple Silicon!. They also provide a codebase to simulate DiLoCo on your local macbook: Thatās great for quick experimentation like with NanoGPT.
Distributed training on M4 Mac Mini cluster. We implemented @GoogleDeepMind DiLoCo on Apple Silicon to train large models with 100-1000x less bandwidth compared to DDP baseline. AI is entering a new era where a distributed network of consumer devices can train large models.
0
6
71
what kind of method could enable @huggingface to train a model across the world? š¤.
4
6
70
one more implementation of DiLoCo to do distributed training!. @PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo. hopefully soon a Streaming DiLoCo too?
2
7
70
@JeffDean @dwarkesh_sp some people actually came to me in SF and told me "but DiLoCo is actually working!" being very surprised that it wasn't just another paper misleading with outlandish claims. learn more:.
We release today the next step for distributed training: .--> Streaming DiLoCo with Overlapping Communication. TL;DR: train data-parallel across the world with low-bandwidth for the same performance: 400x less bits exchanged & huge latency tolerance
2
1
68
My last paper as a phd student š¤©.
Pending #CVPR2023 in June, we are pleased to share our 4 accepted papers. (3/4) "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation".by @fcdl94, @quobbe, @Ar_Douillard . preprint: Collab w/ Politecnico di Torino and @heuritechlab
0
2
64
Great!. I'm finishing my PhD in June, and CVPR 2022 will be my only opportunities to have in-person conference of my whooole PhD.
Message from our #CVPR2022 Program Chairs:. Unless the epidemiologicalĀ situation changes drastically, CVPR 2022 will be in person, with an onlineĀ option for those who cannot travel. Information on visaĀ letters will be sent to authors in the nextĀ few days.
3
1
65
Something I didn't fully realize during my PhD but now see: the extended bitter lesson is that hyperparameters are sometimes more important than a new architecture. So many research papers proposing new archi/losses/optimizers would get crushed by a well-tuned baseline.
The DeepSeek-V2 paper was full of pretty amazing nuggets of wisdom. I spent the afternoon copying lots of their training setup into our model. Orange is previous and Blue is new with DeepSeek hyper parameters. Things that mattered most:. 1. Warm up LR ratio.2. Batch ramp
3
5
63
Tired of implementing the many data settings of Continual Learning?. @TLesort & I present Continuum!. A Pytorch library that enables you in a few lines to have a Continual dataset: MNIST, PermutedMNIST, CIFAR10/CIFAR100, ImageNet, CORe50, and many more!.
1
26
65
Wow! . DiLoCo ( and OpenDiLoCo ( were recognized by @nathanbenaich's @stateofaireport. Lots of research existed before, but i believe in 2024 and even more 2025, we'll switch from an exploration mode to an exploit mode. Scaling has
5
5
63
I think people donāt realize the progress in AI that happened in the last 5 years. Being close to the level of a junior dev isnāt impressive anymore?.
@itsandrewgao Sounds a bit disappointing honestly. The requests were a bit hard, but a good AI should be able to solve these, they aren't exactly rocket science. I think most jr devs would be able to solve them. One request it couldn't even complete and the other just deployed a buggy solution.
7
1
59
Itās crazy that model merging works as well as it does. Strongly recommend following @Mitchnw @ramealexandre @prateeky2806 who have done a lot in that field.
Model merging is a popular research topic with applications to LLM alignment and specialization. But, did you know this technique has been studied since the 90s? Hereās a brief timelineā¦. (Stage 0) Original work on model merging dates back to the 90s [1], where authors showed
6
6
58