Max Ryabinin
@m_ryabinin
Followers
2K
Following
301
Media
46
Statuses
184
Large-scale deep learning & research @togethercompute Learning@home/Hivemind author (DMoE, DeDLOC, SWARM, Petals) PhD in decentralized DL '2023
Joined October 2020
In our new #ACL2024 paper, we show that LLMs remain sensitive to prompt formats even with improved few-shot techniques. Our findings suggest that careful evaluation needs to take this lack of robustness into account 📜: https://t.co/05fi54Q5MC 🖥️: https://t.co/US73vNQTlM
2
11
69
I found a cool tech report on combining DiLoCo with @m_ryabinin's SWARM pipelining with fault tolerance and checked what the author is doing now. I should have guessed: he's at @PrimeIntellect now.
2
4
67
📣LLM-as-a-qualitative-judge: automating error analysis in natural language generation TLDR: our approach outputs a summary of error types and their counts in an NLP system 📜 Paper: https://t.co/i5ZRe9ttvd 💻 Code to try it on your task: https://t.co/4ahLrSjbnP
#NLProc #LLM
1
2
5
The screenshot is from Google's SentencePiece ( https://t.co/bYe9Lkr2Ug), proposed for machine translation at EMNLP'18 It'd be curious to learn why some approaches to tokenization (e.g., byte-level à la GPT-2) got way more popular than others. Is it mainly due to their
github.com
Unsupervised text tokenizer for Neural Network-based text generation. - google/sentencepiece
0
0
1
Very insightful blogpost! IMO tokenization is a part of NLP pipelines which receives way less attention than it should As an aside, while reading the summary of SuperBPE, I realized that space-agnostic tokenization and other recent improvements go way back to pre-LLM times:
Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as
1
0
3
Decentralized DL projects built on top of Hivemind ( https://t.co/9x9RmCg2A0): * node0 by @PluralisHQ * OpenDiloco by @PrimeIntellect * rl-swarm by @gensynai If you know any others, please share them with me! Would love to help more researchers interested in this field
github.com
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world. - learning-at-home/hivemind
2
7
17
With SWARM, we've shown back in 2023 that pipeline-parallel training over the Internet is feasible: https://t.co/0pLvJ05ea3 IMO pure data parallelism isn't enough if we want to train truly big models outside of clusters. To the best of my knowledge, @PluralisHQ are the only ones
We present SWARM, an efficient algorithm for model-parallel training across the Internet (e.g. with volunteers). Key advantages: 💎 Fault-tolerant ⚖️ Self-balancing on slow GPUs/networks 🐌 Works in low-bandwidth setups 📜 https://t.co/wCnf6vDCv4 🖥️ https://t.co/pVe0GDmfrK
1
3
19
Fault-tolerant pipeline parallelism (SWARM/Petals-style) with compression, scaled to an open pretraining run! On a personal note, I'm really glad that Hivemind powers a few parts of this project — both for pipelines and for robust data parallelism
Node-0-7.5B is live. It is a permissionless, multi-participant, model-parallel pretraining run over the open internet. Anyone with a 16GB+ GPU can join. Node-0 allows participants to collaboratively train a model far larger than could be done as individuals.
1
16
64
On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) https://t.co/w2nhjqNxPb 2/
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
1
3
15
Sadly, I won't be at the conference, but come meet @jackminong who led this project!
Toploc Poster session tomorrow (Wed) at 4:30 PM East Hall E-1106 I’ll be around through Saturday; if you’re into decentralized training & inference, lets chat!
0
0
4
Here's the original thread about TOPLOC that explains the ideas behind it:
Today, we release TOPLOC: A Locality Sensitive Hashing Scheme for Verifiable Inference - Detects modifications to models, prompts, or precision - Robust across GPU types, tensor parallel configurations and attention kernels - Up to 100× faster validation than generation -
1
0
4
If you're at ICML and interested in verifiable inference, make sure to stop by our poster! We will present TOPLOC, an efficient activation hashing method that works across a variety of settings, e.g. switching inference setups or even models. July 16, 4:30pm, E-1106
1
2
11
From my experience, getting a paper on decentralized DL accepted to top-level conferences can be quite tough. The motivation is not familiar to many reviewers, and standard experiment settings don't account for the problems you aim to solve. Hence, I'm very excited to see
For people not familiar with AI publishing; there are 3 main conferences every year. ICML, ICLR and NeurIPS. These are technical conferences and the equivalent of journals in other disciplines - they are the main publishing venue for AI. The competition to have papers at these
2
7
43
@gowthami_s @JangLawrenceK @IAmTimNguyen @ishapuri101 Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: https://t.co/bmxbv7AwBZ
1
6
30
Very grateful to have an opportunity to meet researchers from @CaMLSys/@flwrlabs and share some current thoughts on decentralized and communication-efficient deep learning. Thanks to @niclane7 for the invitation!
Looking forward to spending the day with @m_ryabinin, one of the leading figures in decentralized AI. Amazing talk for those nearby Thanks for visiting @CaMLSys Max!
0
0
9
Thanks a lot to Ferdinand for hosting this conversation! It was a great opportunity to overview all parts of SWARM and discuss the motivation behind them in depth. I hope this video will make decentralized DL more accessible: many ideas in the field are simpler than they seem!
The research paper video review on "Swarm Parallelism" along with the author @m_ryabinin, Distinguished Research Scientist @togethercompute is now out ! Link below 👇 For context, most decentralized training today follows DDP-style approaches requiring full model replication on
0
5
18
@lucasmaes_ There is a lot to dig in, the latest prime intellect paper are very up to date in term of scale / sota. To get deep into the field I suggest reading paper from @m_ryabinin @Ar_Douillard and Martin Jaggi some paper https://t.co/wU07ju07du
https://t.co/ujjfsfVzXN
arxiv.org
Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected...
2
5
17
We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]
20
78
399
Looking forward to discussing SWARM next Monday, thanks to @FerdinandMom for the invite! Many works about Internet-scale DL target communication savings, but once you want to train large models over random GPUs, other challenges arise. Turns out that pipelining can help here!
Most decentralized training today follows DDP-style approaches requiring full model replication on each node. While practical for those with H100 clusters at their disposal, this remains out of reach for the vast majority of potential contributors. Delving back into the
0
4
24
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
8
10
18
@Ar_Douillard There is also a lot of relevant ideas from earlier work in async/distributed RL, e.g. A3C ( https://t.co/1ABeddNUhK) or IMPALA ( https://t.co/oYj3yguHWA) I wonder if some methods or learnings from that era could find novel use for RL+LLMs: certain challenges could be quite similar
1
3
15