Ar_Douillard Profile Banner
Arthur Douillard Profile
Arthur Douillard

@Ar_Douillard

Followers
7K
Following
17K
Media
567
Statuses
4K

distributed learning @ deepmind | DiLoCo, DiPaCo | world-wide compute arbitrage

Joined January 2016
Don't wanna be here? Send us removal request.
@Ar_Douillard
Arthur Douillard
4 months
We release today the next step for distributed training: .--> Streaming DiLoCo with Overlapping Communication. TL;DR: train data-parallel across the world with low-bandwidth for the same performance: 400x less bits exchanged & huge latency tolerance
Tweet media one
19
106
555
@Ar_Douillard
Arthur Douillard
8 months
The Nobel prize website has a pretty well made 1-pager on how Alphafold works
Tweet media one
16
1K
6K
@Ar_Douillard
Arthur Douillard
3 months
from @JeffDean at @dwarkesh_sp podcast:. "asynchronous training where each copy of the model does local computation [. ] it makes people uncomfortable [. ] but it actually works". yep, i can confirm, it does work for real
16
53
796
@Ar_Douillard
Arthur Douillard
4 years
I've released my course on deep learning for computer vision!. It includes slides, google colab, and Anki decks for all 6 topics I'm covering. We code from the basics (backprop from scratch) to the SotA (transformers & MLP-mixer). Feedback appreciated!
Tweet media one
14
158
663
@Ar_Douillard
Arthur Douillard
4 years
Vision transformers are more biased towards shapes (as humans are) than Convolutional Networks:.
Tweet media one
7
138
650
@Ar_Douillard
Arthur Douillard
6 months
Excellent explanation of RoPE embedding, from scratch with all the math needed: . And with beautiful 3blue1brown's style of animation: Original RoPE paper:
Tweet media one
7
106
654
@Ar_Douillard
Arthur Douillard
3 years
I am excited to share that, after my PhD šŸ‘Øā€šŸŽ“, I will join @DeepMind this summer as a Research Scientist in the Continual Learning team led by Marc'Aurelio Ranzato! šŸŽ‰.
28
6
459
@Ar_Douillard
Arthur Douillard
7 months
Distributed learning?. Recently, @PrimeIntellect have announced their 10B distributed learning (, what is it exactly?. Going back to the origin, Federated Learning (FL, aims to train a model across a fleet of phones. To handle the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@PrimeIntellect
Prime Intellect
8 months
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model. Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI šŸ¦‹
17
68
450
@Ar_Douillard
Arthur Douillard
10 months
"White House says no need to restrict ā€˜open-source’ artificial intelligence — at least for now".
19
67
399
@Ar_Douillard
Arthur Douillard
2 years
🚨 We released our work on data parallelism for language models *distributed* across the entire world!. 🧵Thread below šŸ‘‡.
@_akhaliq
AK
2 years
DiLoCo: Distributed Low-Communication Training of Language Models. paper page: Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of
Tweet media one
17
66
373
@Ar_Douillard
Arthur Douillard
4 years
Github + VSCode on your browser = 🤯. Just add "1s" before the ".com", and tada!. Here is an example with our Continual Learning library Continuum:.
Tweet media one
9
83
372
@Ar_Douillard
Arthur Douillard
8 months
Dynamic Compute. Matformer ( proposes to train jointly multiple models, each a subset of the other, like matryoshka dolls. Flextron ( extends it with a learned router, to choose dynamically which subset to take according to more
Tweet media one
Tweet media two
5
55
365
@Ar_Douillard
Arthur Douillard
9 months
AlphaZero search to explore new circuits design!.
Tweet media one
7
56
299
@Ar_Douillard
Arthur Douillard
9 months
Obsidian Research Workflow. TL;DR: to do literature review faster and link things together. I keep it simple: 3 kind of notes, designed by hashtags and colored differently on the graph. Paper (green), notes (yellow), and Map-of-Content (red).
Tweet media one
10
23
301
@Ar_Douillard
Arthur Douillard
1 year
I'm super excited to release DiPaCo, a new kind of mixture of experts, that can scale engineering-wise to data centers across the entire world!. A few words about it in this thread 🧵.
@_akhaliq
AK
1 year
Google presents DiPaCo. Distributed Path Composition. Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high
Tweet media one
12
49
296
@Ar_Douillard
Arthur Douillard
3 months
really enjoyed "Old Optimizer, New Norm: An Anthology", they look at three optimizers (adam, shampoo, prodigy), and remark they are all doing a steepest descent under a particular norm:. classical SGD is doing steepest descent according to l2 norm, adam &
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
32
304
@Ar_Douillard
Arthur Douillard
2 years
Main topic in NeurIPS parties is GPT-4. The rumors are wild.
11
17
269
@Ar_Douillard
Arthur Douillard
2 years
🚨 My team at @DeepMind is looking for a Research Engineer in Efficient Large-Scale Learning! . šŸ‘‰. ā“ Unprecedented scale + efficiently adaptation to new tasks. šŸ“š Distributed large-scale learning and continual learning!.
5
37
246
@Ar_Douillard
Arthur Douillard
3 years
Google released a few days ago Minerva (, a language model (PaLM) that solves highschool math problems. Funny part: the prompt includes "I hope it is correct"!
Tweet media one
9
33
243
@Ar_Douillard
Arthur Douillard
4 months
re: OpenAI's DeepResearch on the Last Humanity's Exam. on one hand, it's unfair to compare an AI with web-access vs others without, especially for knowledge intensive task. on the other hand, we need to stop cramming random facts in a model, and invest more in tool-use.
16
19
219
@Ar_Douillard
Arthur Douillard
1 year
Something is cooking šŸ•µļøā€ā™‚ļø
Tweet media one
9
7
207
@Ar_Douillard
Arthur Douillard
8 months
Lots of papers recently about having more dynamic compute in MoE:. * different sized experts: * different # of activated experts:
5
38
209
@Ar_Douillard
Arthur Douillard
7 months
KV Prediction for Improved Time to First Token. LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
47
206
@Ar_Douillard
Arthur Douillard
5 months
The NeurIPS best paper is worth $1.1 million (?)
Tweet media one
5
19
200
@Ar_Douillard
Arthur Douillard
2 months
naming is fun in ML. DiLoCo is basically FedOpt with inner Adam and outer Nesterov and without client sampling, and that's the best combo that is known yet. FedOpt is FedAvg/LocalSGD with adaptive optimization magic. FedAvg/LocalSGD is. this NeurIPS from 1993 🤯
Tweet media one
3
22
202
@Ar_Douillard
Arthur Douillard
7 months
Don't waste flops, upcycle ā™»ļø. 1. Upcyling ( @arankomatsuzaki et al.) proposes to transform a pretrained dense network into a MoE, by duplicating its MLP layers. You still have to train from scratch the (relatively small) routers, but this saves lots of
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
35
187
@Ar_Douillard
Arthur Douillard
3 months
one more step towards decentralized learning: Eager Updates. can we overlap communication with computation over hundred of steps? .-- yes we can. in this work led by @SatyenKale, we improve DiLoCo and use x1177 less bandwidth than data-parallel
Tweet media one
3
21
173
@Ar_Douillard
Arthur Douillard
10 months
Running out of human-generated data by 2028?
Tweet media one
9
11
159
@Ar_Douillard
Arthur Douillard
6 months
Adaptive Decoding via Latent Preference Optimization: . * Add a small MLP + classifier which predict a temperature per token.* They train the MLP with a variant of DPO ( with the temperatures as latent.* predicted temp for
Tweet media one
0
35
162
@Ar_Douillard
Arthur Douillard
3 months
updated my website laying out my dream:. compute arbitrage across the world. no devices should be left idle.
Tweet media one
7
13
163
@Ar_Douillard
Arthur Douillard
3 years
From being a computer lover, to being a Doctor in computer science
Tweet media one
Tweet media two
9
2
159
@Ar_Douillard
Arthur Douillard
5 months
Very cool to see @JeffDean highlighting our DiPaCo ( in the MoE history at Google!. (see last line šŸ‘€)
Tweet media one
@swyx
swyx šŸ”œ @aiDotEngineer (Jun 3-5)
5 months
oh wow, @JeffDean dropping incredibles amounts of lore on Gemini 1.5 Pro and 2.0 Flash this morning. the references alone can fill up the @latentspacepod paper club for a year. (this is only part of the "everything we do better than you at google" talk, he is talking about ai
Tweet media one
Tweet media two
3
13
154
@Ar_Douillard
Arthur Douillard
4 months
the first job of the ASI will be to write proper software for AMD.
7
5
151
@Ar_Douillard
Arthur Douillard
4 years
Ok, I've learn today that there is a 'inference_mode' context manager that does the 'no_grad' job, but with added speed. Seen from the Grokking Pytorch:
Tweet media one
6
26
148
@Ar_Douillard
Arthur Douillard
3 years
šŸŽ„It's christmas' time, so we recently added plenty of datasets for continual learning!. 50 datasets for classification & segmentation. 7 different continual scenarios. šŸ‘‰ And surprise, we now support HuggingFace's NLP datasest! šŸ‘‡šŸ§µ
Tweet media one
1
21
144
@Ar_Douillard
Arthur Douillard
8 months
Federated Learning (FedAvg FedOpt DiLoCo OpenDiLoCo etc.) methods reduce communication between cross-regions datacenters by only exchanging ~outer gradient every once in a while.
Tweet media one
Tweet media two
1
26
145
@Ar_Douillard
Arthur Douillard
4 months
Workshop alert 🚨. We'll host in ICLR 2025 a workshop on modularity, encompassing collaborative + decentralized + continual learning. Those topics are on the critical path to building better AIs. Interested? submit a paper and join us in Singapore!.
Tweet media one
4
39
138
@Ar_Douillard
Arthur Douillard
7 months
ā€œSome sort of Federated Learning, async distributed computing will have to work.ā€œ.— Jensen. on it 🫔.
@BG2Pod
Bg2 Pod
7 months
BG2. Ep 17. Double $NVDA! System Level Comp Moat, ā€œInsane Demandā€, Inference Explosion 1 B x, Memphis Supercluster, OpenAI, & more. @altcap @_clarktang @bgurley. (00:00) Intro. (1:50) The Evolution of AGI and Personal Assistants. (06:03) NVIDIA's
8
25
140
@Ar_Douillard
Arthur Douillard
7 months
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data. I strongly believe that synthetic data will be needed for these kind of high quality domains where it's hard to get tons of annotations
Tweet media one
6
33
137
@Ar_Douillard
Arthur Douillard
1 month
in singapore for ICLR. who should i chat to about distributed learning?
Tweet media one
12
6
140
@Ar_Douillard
Arthur Douillard
10 months
Mixture of Nested Experts: Route tokens to encoder transformer blocks of different sizes. Harder tokens should be routed ideally to larger blocks. Communication (non-masked, it's an encoder) still happen between tokens as they are projected to the same
Tweet media one
3
29
132
@Ar_Douillard
Arthur Douillard
8 months
Exploring Scaling Laws for Local SGD in Large Language Model Training. They propose scaling laws for DiLoCo ( ~FedOpt/LocalSGD with inner Adam + outer Nesterov) from 5M to 3B parameters. Afaik this is the largest publicly trained
Tweet media one
3
18
132
@Ar_Douillard
Arthur Douillard
11 months
No code too.
@LeopolisDream
Alex Yanko šŸ‡ŗšŸ‡¦
11 months
Welcome the new architecture: . Terminator . No residuals, no dot product attention, no normalization.
Tweet media one
5
3
126
@Ar_Douillard
Arthur Douillard
6 months
LLMs Know More Than They Show: . * Adding a true-seeking classifier probe on the token embeddings can have better performance than the actual generation.* Is something wrong going on in the decoding part?.* Those error detectors don't generalize across
Tweet media one
5
19
126
@Ar_Douillard
Arthur Douillard
6 months
Very interesting. DeMo syncs only the fast components of the signal, while the other slower components are accumulated in the momentum. Two things that comes to my mind is this intuition to keeping the slow gradients (grokfast: , and reducing
Tweet media one
@teortaxesTex
Teortaxesā–¶ļø (DeepSeek ęŽØē‰¹šŸ‹é“ē²‰ 2023 – āˆž)
6 months
So about that Nous DisTrO project, of which many very cracked people are skeptical. Now there's paper on arxiv and PyTorch implementation on github. Also it's called DeMo (Decoupled Momentum Optimization). These are results. This is the idea. I'm interested in how this ends.
Tweet media one
Tweet media two
5
16
113
@Ar_Douillard
Arthur Douillard
6 months
Distributed Decentralized Training of Neural Networks: A Primer: .. DP's AllReduce, variants thereof + advanced methods as SWARM ( and DiLoCo ().
2
20
111
@Ar_Douillard
Arthur Douillard
3 months
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU. what: proposes to extend context to 1M tokens without additional training [1]. 1. pruning. * do a three pruning stages.* in each group contiguous tokens in chunk. in each chunk, select a
Tweet media one
Tweet media two
Tweet media three
5
22
108
@Ar_Douillard
Arthur Douillard
10 months
In 2023, we showed that with our particular federated/distributed configuration DiLoCo (Adam + Nesterov = ā¤ļø, you could train relatively large (0.5B) LLM distributed across the world while being as good as centralized training. @PrimeIntellect released
Tweet media one
3
15
105
@Ar_Douillard
Arthur Douillard
8 months
OpenDiLoCo ( with 7B and trained across the world?.
@PrimeIntellect
Prime Intellect
8 months
Contribute compute and power the future of decentralized training. Coming soon šŸ¦‹
Tweet media one
2
19
103
@Ar_Douillard
Arthur Douillard
3 years
PixMix: merging images with fractals. Leads to models more robust to corruption, adversary attacks, with better calibration, etc. than the other baselines (MixUp, CutMix, CutOut).
Tweet media one
2
24
100
@Ar_Douillard
Arthur Douillard
9 months
HMoE - Heterogeneous Mixture of Experts for Language Modeling. TL;DR: MoE with experts of various capacity. 1. Thanks to Megablocks ( we can run a MoE layer with different expert sizes.2. classical balancing loss isn't enough as the
Tweet media one
3
16
102
@Ar_Douillard
Arthur Douillard
1 month
30+ accepted papers. 6 oral papers. 6 guest speakers. join us at @iclr_conf on the 27th Hall 4 #3 for a full day of workshop on Modularity for Collaborative, Decentralized, and Continual Learning. @derylucio, Fengyuan Liu, and myself will be organizing
Tweet media one
@Ar_Douillard
Arthur Douillard
4 months
Workshop alert 🚨. We'll host in ICLR 2025 a workshop on modularity, encompassing collaborative + decentralized + continual learning. Those topics are on the critical path to building better AIs. Interested? submit a paper and join us in Singapore!.
Tweet media one
3
28
102
@Ar_Douillard
Arthur Douillard
4 years
PoolFormer: replacing self-attention / spatial mlp / fourier transform by a simple average pooling. - Less operations (each pooling reducing number of tokens by 2x).- As good as other "meta-former". Are we going to reinvent convnets?
Tweet media one
1
27
97
@Ar_Douillard
Arthur Douillard
2 years
0
15
94
@Ar_Douillard
Arthur Douillard
10 months
Two of my favorites blog-doing-AI-literature-review are @lilianweng's and @cwolferesearch's What other blog should I read?. Broad scope probably preferred to single deep topic.
9
7
93
@Ar_Douillard
Arthur Douillard
8 months
So DisTrO is a kind of a ElasticSGD ( / PAPA (?.
@nearcyan
near
8 months
DisTrO formula released
Tweet media one
4
17
87
@Ar_Douillard
Arthur Douillard
12 days
this infra framework ( + using SWARM ( on the inference node to fit ultra large models is going to be the future. one step closer to the GitTheta ( dream
Tweet media one
@PrimeIntellect
Prime Intellect
13 days
Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning:. • Detailed Technical Report.• INTELLECT-2 model checkpoint.
6
15
94
@Ar_Douillard
Arthur Douillard
8 months
Really cool to see more efforts leveraging available compute as a commodity, here for inference: To go from inference to training, new algorithms ( and new models will have to be invented (
Tweet media one
2
10
88
@Ar_Douillard
Arthur Douillard
10 months
DeepSeek (MoE paper: and Ant (Matformer-like for vision: are both Chinese finance companies publishing AI research. Do we have equivalent in the west?.
10
18
89
@Ar_Douillard
Arthur Douillard
1 year
The Future of Large Language Model Pre-training is Federated, amen to that!.
4
20
85
@Ar_Douillard
Arthur Douillard
2 months
i’m going to repeat myself:. DiLoCo just works.
@MatharyCharles
Zachary Charles
2 months
We just put out a key step for making distributed training work at larger and larger models: Scaling Laws for DiLoCo. TL;DR: We can do LLM training across datacenters in a way that scales incredibly well to larger and larger models!
Tweet media one
3
7
85
@Ar_Douillard
Arthur Douillard
4 years
I love when fixing a bug in my neural network degrades its performance.
0
7
79
@Ar_Douillard
Arthur Douillard
3 years
I've got my christmas present early:. More than 11k unique visitors on my deep learning for computer vision course! šŸ¤— I'm so happy!. šŸ‘‰ 2022's goal: recording a video for each lesson
Tweet media one
0
14
79
@Ar_Douillard
Arthur Douillard
11 months
I really wished we could have open-sourced DiLoCo. But it’s even better to see it reproduced by others. It’s crazy that distributed computing works as well despite communicating orders of magnitude less. Accelerate! šŸš€.
4
8
82
@Ar_Douillard
Arthur Douillard
27 days
ok @iclr_conf, im logging off — thanks for the week!
Tweet media one
@Ar_Douillard
Arthur Douillard
28 days
starting soon, in hall 4 #3. see you all there and follow that thread :)
Tweet media one
1
2
84
@Ar_Douillard
Arthur Douillard
6 months
Very very cool, you can train now your LLM across the world with dynamic size compute pool!. Awesome work for the PrimeIntellect's team to DiLoCo ( available to everyone with lots of engineering tricks to be so efficient.
@PrimeIntellect
Prime Intellect
6 months
Releasing INTELLECT-1: We’re open-sourcing the first decentralized trained 10B model:. - INTELLECT-1 base model & intermediate checkpoints.- Pre-training dataset.- Post-trained instruct models by @arcee_ai.- PRIME training framework.- Technical paper with all details
3
22
80
@Ar_Douillard
Arthur Douillard
4 months
really like this point from @jackclarkSF's importAI. having a perfect super smart assistant decrease the importance of hard skills but increase the importance of agency and curiosity
Tweet media one
3
14
83
@Ar_Douillard
Arthur Douillard
9 months
FocusLLM: Scaling LLM's Context by Parallel Decoding. TL;DR: train LM with 8k context but can extrapolate to 400k, by compressing long-context into a few tokens. 1. split context (green) into chunks, for each append the full prompt (yellow).2.
Tweet media one
3
12
81
@Ar_Douillard
Arthur Douillard
6 months
World-wide decentralized training with an open-source DiLoCo is done. Note that date for history books.
@PrimeIntellect
Prime Intellect
6 months
We did it — the first decentralized training of a 10B model is complete!. Trained across the US, Europe, and Asia 🌐. Post-training with @arcee_ai is underway, and a full open-source release is coming in ~1 week, including: base model, checkpoints, post-trained model and data.
Tweet media one
3
9
79
@Ar_Douillard
Arthur Douillard
9 months
Open Mixture-of-Experts Language Models. Very cool paper explaining all the design choices on how they made a powerful small MoE.
Tweet media one
2
8
76
@Ar_Douillard
Arthur Douillard
3 months
don't forget
Tweet media one
4
0
77
@Ar_Douillard
Arthur Douillard
4 years
I've submitted my first paper ever at CVPR2020 and got rejected, it was hard. But I'm happy to announce that my third paper, PLOP, has been accepted to #CVPR2021!. Code will be released soon!.
@Ar_Douillard
Arthur Douillard
4 years
New work from Y.Chen, A.Dapogny, @quobbe, and myself. We tackle Continual Semantic Segmentation by introducing a novel distillation loss exploiting local & global details, and an uncertainty-based pseudo-labeling handling background shift. (We are PLOP).
Tweet media one
4
7
73
@Ar_Douillard
Arthur Douillard
4 months
four challenges could prevent scaling according to @EpochAIResearch . 3 of them may be alleviated by DiLoCo:.1. Localized power constraint: distributed training could use power plants anywhere on earth.2. Chip production capacity: async communication allows use heterogenous
Tweet media one
3
5
73
@Ar_Douillard
Arthur Douillard
9 months
AdEMAMix optimizer ( does two EMA for the numerator of Adam, a fast one (low \beta) and a slow one (high \beta). It could explain the good performance of FedOpt variants, like DiLoCo (, with 1 optimizer for gradient, and 1
Tweet media one
2
18
71
@Ar_Douillard
Arthur Douillard
4 months
with deepseek-R1-Zero, 2025 may be the year where "Reward is enough" become trendy again.
7
8
71
@Ar_Douillard
Arthur Douillard
7 months
"Open models have lagged on benchmarks by 5 to 22 months". As long as we don't have yet a recursive intelligence explosion, this is quite a bullish news for open models. 1-2 years delay isn't such a big deal for actual applications
Tweet media one
3
8
72
@Ar_Douillard
Arthur Douillard
1 year
My team is looking for a research engineer in New York! . Our recent efforts include DiLoCo (distributed learning) and DiPaCo (distributed mixture of experts). Those projects that I've co-led, were the most exciting projects I've contributed, and i can tell you one thing: there.
4
5
71
@Ar_Douillard
Arthur Douillard
3 years
The first transformer designed for Continual Learning in Computer Vision has been accepted to #CVPR2022! 🄳. Using a dynamic approach, it forgets less than previous ensembling methods while using fewer parameters. šŸ’»: šŸ“•: šŸ§µšŸ‘‡
Tweet media one
4
16
70
@Ar_Douillard
Arthur Douillard
6 months
Min-p Sampling: . 1. Get max prob.2. Find min prob based on a threshold \in [0, 1] \times that max prob.3. Gather only tokens probs above that min prob.4. Sample in that pool, according to renormalized probs. More robust to change in temperature!
Tweet media one
1
11
70
@Ar_Douillard
Arthur Douillard
5 months
Run distributed training DiLoCo on Apple Silicon!. They also provide a codebase to simulate DiLoCo on your local macbook: That’s great for quick experimentation like with NanoGPT.
@exolabs
EXO Labs
5 months
Distributed training on M4 Mac Mini cluster. We implemented @GoogleDeepMind DiLoCo on Apple Silicon to train large models with 100-1000x less bandwidth compared to DDP baseline. AI is entering a new era where a distributed network of consumer devices can train large models.
0
6
71
@Ar_Douillard
Arthur Douillard
2 months
what kind of method could enable @huggingface to train a model across the world? šŸ¤”.
@johannes_hage
Johannes Hagemann
2 months
.@Thom_Wolf on the Boom project training a 70-100B parameter model decentralized
Tweet media one
4
6
70
@Ar_Douillard
Arthur Douillard
7 years
@ncremins GDPR is coming.
0
24
65
@Ar_Douillard
Arthur Douillard
3 months
one more implementation of DiLoCo to do distributed training!. @PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo. hopefully soon a Streaming DiLoCo too?
Tweet media one
2
7
70
@Ar_Douillard
Arthur Douillard
3 months
@JeffDean @dwarkesh_sp some people actually came to me in SF and told me "but DiLoCo is actually working!" being very surprised that it wasn't just another paper misleading with outlandish claims. learn more:.
@Ar_Douillard
Arthur Douillard
4 months
We release today the next step for distributed training: .--> Streaming DiLoCo with Overlapping Communication. TL;DR: train data-parallel across the world with low-bandwidth for the same performance: 400x less bits exchanged & huge latency tolerance
Tweet media one
2
1
68
@Ar_Douillard
Arthur Douillard
2 years
My last paper as a phd student 🤩.
@mlia_isir
MLIA
2 years
Pending #CVPR2023 in June, we are pleased to share our 4 accepted papers. (3/4) "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation".by @fcdl94, @quobbe, @Ar_Douillard . preprint: Collab w/ Politecnico di Torino and @heuritechlab
Tweet media one
0
2
64
@Ar_Douillard
Arthur Douillard
3 years
Great!. I'm finishing my PhD in June, and CVPR 2022 will be my only opportunities to have in-person conference of my whooole PhD.
@CVPR
#CVPR2025
3 years
Message from our #CVPR2022 Program Chairs:. Unless the epidemiologicalĀ situation changes drastically, CVPR 2022 will be in person, with an onlineĀ option for those who cannot travel. Information on visaĀ letters will be sent to authors in the nextĀ few days.
3
1
65
@Ar_Douillard
Arthur Douillard
6 months
šŸ‘€.
3
12
67
@Ar_Douillard
Arthur Douillard
1 year
Something I didn't fully realize during my PhD but now see: the extended bitter lesson is that hyperparameters are sometimes more important than a new architecture. So many research papers proposing new archi/losses/optimizers would get crushed by a well-tuned baseline.
@andrew_n_carr
Andrew Carr (e/🤸)
1 year
The DeepSeek-V2 paper was full of pretty amazing nuggets of wisdom. I spent the afternoon copying lots of their training setup into our model. Orange is previous and Blue is new with DeepSeek hyper parameters. Things that mattered most:. 1. Warm up LR ratio.2. Batch ramp
Tweet media one
3
5
63
@Ar_Douillard
Arthur Douillard
5 years
I'm proud to present you my first ever paper uploaded recently on arXiv:. "Small Task Incremental Learning". We design a novel distillation loss that outperform previous SotA by a large margin, especially on 50 tasks of only 1 class!.
Tweet media one
7
20
65
@Ar_Douillard
Arthur Douillard
5 years
Tired of implementing the many data settings of Continual Learning?. @TLesort & I present Continuum!. A Pytorch library that enables you in a few lines to have a Continual dataset: MNIST, PermutedMNIST, CIFAR10/CIFAR100, ImageNet, CORe50, and many more!.
Tweet media one
1
26
65
@Ar_Douillard
Arthur Douillard
8 months
Delayed Parameters Update (DPU; splits the computation in 2 streams, CPU and GPU, to perform param update in parallel to forward. Its delay brings instability, thus instead of using the stale params, ACCO ( proposes to estimate
Tweet media one
Tweet media two
3
11
65
@Ar_Douillard
Arthur Douillard
1 month
so much research in distributed/federated learning are on 1) CIFAR-level datasets and 2) CNN archis. there the dataset size, model size, and required flops are ridiculous, this is not the place where to study distributed learning!.
6
3
65
@Ar_Douillard
Arthur Douillard
6 months
Deepseek released deepseek-r1, an "equivalent" to OpenAI's O1: Given that deepseek has been very open in the past (e.g. , I'm very hopeful they will disclose more details about R1 too
Tweet media one
0
13
62
@Ar_Douillard
Arthur Douillard
8 months
Wow! . DiLoCo ( and OpenDiLoCo ( were recognized by @nathanbenaich's @stateofaireport. Lots of research existed before, but i believe in 2024 and even more 2025, we'll switch from an exploration mode to an exploit mode. Scaling has
Tweet media one
5
5
63
@Ar_Douillard
Arthur Douillard
3 years
Transformers for Small-Scale Datasets: - Tokenization with overlap between patches.- Add pooling to reduce nb of tokens.- Mask with -āˆž the diagonal attention logits to avoid tokens attending on themselves.- add learned temperature.- improve all archi
Tweet media one
Tweet media two
1
6
59
@Ar_Douillard
Arthur Douillard
3 months
super intelligence *will* be distributed.
@ClementDelangue
clem šŸ¤—
3 months
Distributed intelligence > super intelligence!.
4
5
61
@Ar_Douillard
Arthur Douillard
1 year
I think people don’t realize the progress in AI that happened in the last 5 years. Being close to the level of a junior dev isn’t impressive anymore?.
@gonza_nardini
Gonza Nardini
1 year
@itsandrewgao Sounds a bit disappointing honestly. The requests were a bit hard, but a good AI should be able to solve these, they aren't exactly rocket science. I think most jr devs would be able to solve them. One request it couldn't even complete and the other just deployed a buggy solution.
7
1
59
@Ar_Douillard
Arthur Douillard
7 months
New life goal.
Tweet media one
3
6
56
@Ar_Douillard
Arthur Douillard
1 year
TPUs are pretty great tbh. One of the best move Google ever made.
@a__tomala
Alex Tomala
1 year
Just use TPUs.
3
1
57
@Ar_Douillard
Arthur Douillard
9 months
It’s crazy that model merging works as well as it does. Strongly recommend following @Mitchnw @ramealexandre @prateeky2806 who have done a lot in that field.
@cwolferesearch
Cameron R. Wolfe, Ph.D.
9 months
Model merging is a popular research topic with applications to LLM alignment and specialization. But, did you know this technique has been studied since the 90s? Here’s a brief timeline…. (Stage 0) Original work on model merging dates back to the 90s [1], where authors showed
Tweet media one
6
6
58