m_ryabinin Profile Banner
Max Ryabinin Profile
Max Ryabinin

@m_ryabinin

Followers
2K
Following
301
Media
46
Statuses
184

Large-scale deep learning & research @togethercompute Learning@home/Hivemind author (DMoE, DeDLOC, SWARM, Petals) PhD in decentralized DL '2023

Joined October 2020
Don't wanna be here? Send us removal request.
@m_ryabinin
Max Ryabinin
1 year
In our new #ACL2024 paper, we show that LLMs remain sensitive to prompt formats even with improved few-shot techniques. Our findings suggest that careful evaluation needs to take this lack of robustness into account 📜: https://t.co/05fi54Q5MC 🖥️: https://t.co/US73vNQTlM
2
11
69
@Ar_Douillard
Arthur Douillard
1 month
I found a cool tech report on combining DiLoCo with @m_ryabinin's SWARM pipelining with fault tolerance and checked what the author is doing now. I should have guessed: he's at @PrimeIntellect now.
2
4
67
@nadiinchi
Nadia Chirkova
1 month
📣LLM-as-a-qualitative-judge: automating error analysis in natural language generation TLDR: our approach outputs a summary of error types and their counts in an NLP system 📜 Paper: https://t.co/i5ZRe9ttvd 💻 Code to try it on your task: https://t.co/4ahLrSjbnP #NLProc #LLM
1
2
5
@m_ryabinin
Max Ryabinin
2 months
The screenshot is from Google's SentencePiece ( https://t.co/bYe9Lkr2Ug), proposed for machine translation at EMNLP'18 It'd be curious to learn why some approaches to tokenization (e.g., byte-level à la GPT-2) got way more popular than others. Is it mainly due to their
Tweet card summary image
github.com
Unsupervised text tokenizer for Neural Network-based text generation. - google/sentencepiece
0
0
1
@m_ryabinin
Max Ryabinin
2 months
Very insightful blogpost! IMO tokenization is a part of NLP pipelines which receives way less attention than it should As an aside, while reading the summary of SuperBPE, I realized that space-agnostic tokenization and other recent improvements go way back to pre-LLM times:
@iamgrigorev
George Grigorev
2 months
Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as
1
0
3
@m_ryabinin
Max Ryabinin
2 months
Decentralized DL projects built on top of Hivemind ( https://t.co/9x9RmCg2A0): * node0 by @PluralisHQ * OpenDiloco by @PrimeIntellect * rl-swarm by @gensynai If you know any others, please share them with me! Would love to help more researchers interested in this field
Tweet card summary image
github.com
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world. - learning-at-home/hivemind
2
7
17
@m_ryabinin
Max Ryabinin
2 months
With SWARM, we've shown back in 2023 that pipeline-parallel training over the Internet is feasible: https://t.co/0pLvJ05ea3 IMO pure data parallelism isn't enough if we want to train truly big models outside of clusters. To the best of my knowledge, @PluralisHQ are the only ones
@m_ryabinin
Max Ryabinin
3 years
We present SWARM, an efficient algorithm for model-parallel training across the Internet (e.g. with volunteers). Key advantages: 💎 Fault-tolerant ⚖️ Self-balancing on slow GPUs/networks 🐌 Works in low-bandwidth setups 📜 https://t.co/wCnf6vDCv4 🖥️ https://t.co/pVe0GDmfrK
1
3
19
@m_ryabinin
Max Ryabinin
2 months
Fault-tolerant pipeline parallelism (SWARM/Petals-style) with compression, scaled to an open pretraining run! On a personal note, I'm really glad that Hivemind powers a few parts of this project — both for pipelines and for robust data parallelism
@PluralisHQ
Pluralis Research
2 months
Node-0-7.5B is live. It is a permissionless, multi-participant, model-parallel pretraining run over the open internet. Anyone with a 16GB+ GPU can join. Node-0 allows participants to collaboratively train a model far larger than could be done as individuals.
1
16
64
@realDanFu
Dan Fu
4 months
On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) https://t.co/w2nhjqNxPb 2/
@ESFoMo
ES-FoMo@ICML2025
6 months
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
1
3
15
@m_ryabinin
Max Ryabinin
4 months
Sadly, I won't be at the conference, but come meet @jackminong who led this project!
@jackminong
Jackmin
4 months
Toploc Poster session tomorrow (Wed) at 4:30 PM East Hall E-1106 I’ll be around through Saturday; if you’re into decentralized training & inference, lets chat!
0
0
4
@m_ryabinin
Max Ryabinin
4 months
Here's the original thread about TOPLOC that explains the ideas behind it:
@PrimeIntellect
Prime Intellect
10 months
Today, we release TOPLOC: A Locality Sensitive Hashing Scheme for Verifiable Inference - Detects modifications to models, prompts, or precision - Robust across GPU types, tensor parallel configurations and attention kernels - Up to 100× faster validation than generation -
1
0
4
@m_ryabinin
Max Ryabinin
4 months
If you're at ICML and interested in verifiable inference, make sure to stop by our poster! We will present TOPLOC, an efficient activation hashing method that works across a variety of settings, e.g. switching inference setups or even models. July 16, 4:30pm, E-1106
1
2
11
@m_ryabinin
Max Ryabinin
4 months
From my experience, getting a paper on decentralized DL accepted to top-level conferences can be quite tough. The motivation is not familiar to many reviewers, and standard experiment settings don't account for the problems you aim to solve. Hence, I'm very excited to see
@AlexanderLong
Alexander Long
4 months
For people not familiar with AI publishing; there are 3 main conferences every year. ICML, ICLR and NeurIPS. These are technical conferences and the equivalent of journals in other disciplines - they are the main publishing venue for AI. The competition to have papers at these
2
7
43
@Cohere_Labs
Cohere Labs
5 months
@gowthami_s @JangLawrenceK @IAmTimNguyen @ishapuri101 Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: https://t.co/bmxbv7AwBZ
1
6
30
@m_ryabinin
Max Ryabinin
5 months
Very grateful to have an opportunity to meet researchers from @CaMLSys/@flwrlabs and share some current thoughts on decentralized and communication-efficient deep learning. Thanks to @niclane7 for the invitation!
@niclane7
nic lane
5 months
Looking forward to spending the day with @m_ryabinin, one of the leading figures in decentralized AI. Amazing talk for those nearby Thanks for visiting @CaMLSys Max!
0
0
9
@m_ryabinin
Max Ryabinin
5 months
Thanks a lot to Ferdinand for hosting this conversation! It was a great opportunity to overview all parts of SWARM and discuss the motivation behind them in depth. I hope this video will make decentralized DL more accessible: many ideas in the field are simpler than they seem!
@FerdinandMom
Ferdinand Mom
6 months
The research paper video review on "Swarm Parallelism" along with the author @m_ryabinin, Distinguished Research Scientist @togethercompute is now out ! Link below 👇 For context, most decentralized training today follows DDP-style approaches requiring full model replication on
0
5
18
@samsja19
samsja
6 months
@lucasmaes_ There is a lot to dig in, the latest prime intellect paper are very up to date in term of scale / sota. To get deep into the field I suggest reading paper from @m_ryabinin @Ar_Douillard and Martin Jaggi some paper https://t.co/wU07ju07du https://t.co/ujjfsfVzXN
Tweet card summary image
arxiv.org
Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected...
2
5
17
@DAlistarh
Dan Alistarh
6 months
We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]
20
78
399
@m_ryabinin
Max Ryabinin
6 months
Looking forward to discussing SWARM next Monday, thanks to @FerdinandMom for the invite! Many works about Internet-scale DL target communication savings, but once you want to train large models over random GPUs, other challenges arise. Turns out that pipelining can help here!
@FerdinandMom
Ferdinand Mom
6 months
Most decentralized training today follows DDP-style approaches requiring full model replication on each node. While practical for those with H100 clusters at their disposal, this remains out of reach for the vast majority of potential contributors. Delving back into the
0
4
24
@ESFoMo
ES-FoMo@ICML2025
6 months
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
8
10
18
@m_ryabinin
Max Ryabinin
7 months
@Ar_Douillard There is also a lot of relevant ideas from earlier work in async/distributed RL, e.g. A3C ( https://t.co/1ABeddNUhK) or IMPALA ( https://t.co/oYj3yguHWA) I wonder if some methods or learnings from that era could find novel use for RL+LLMs: certain challenges could be quite similar
1
3
15