Max Ryabinin @m_ryabinin X Profile

Max Ryabinin

@m_ryabinin

Followers

2K

Following

301

Media

46

Statuses

184

Large-scale deep learning & research @togethercompute Learning@home/Hivemind author (DMoE, DeDLOC, SWARM, Petals) PhD in decentralized DL '2023

https://t.co/OjHFEqMRQJ

Joined October 2020

Don't wanna be here? Send us removal request.

Max Ryabinin

@m_ryabinin

1 year

In our new #ACL2024 paper, we show that LLMs remain sensitive to prompt formats even with improved few-shot techniques. Our findings suggest that careful evaluation needs to take this lack of robustness into account 📜: https://t.co/05fi54Q5MC 🖥️: https://t.co/US73vNQTlM

2

11

69

Arthur Douillard

@Ar_Douillard

1 month

I found a cool tech report on combining DiLoCo with @m_ryabinin's SWARM pipelining with fault tolerance and checked what the author is doing now. I should have guessed: he's at @PrimeIntellect now.

2

4

67

Nadia Chirkova

@nadiinchi

1 month

📣LLM-as-a-qualitative-judge: automating error analysis in natural language generation TLDR: our approach outputs a summary of error types and their counts in an NLP system 📜 Paper: https://t.co/i5ZRe9ttvd 💻 Code to try it on your task: https://t.co/4ahLrSjbnP #NLProc #LLM

1

2

5

Max Ryabinin

@m_ryabinin

2 months

The screenshot is from Google's SentencePiece ( https://t.co/bYe9Lkr2Ug), proposed for machine translation at EMNLP'18 It'd be curious to learn why some approaches to tokenization (e.g., byte-level à la GPT-2) got way more popular than others. Is it mainly due to their

github.com

Unsupervised text tokenizer for Neural Network-based text generation. - google/sentencepiece

0

1

Max Ryabinin

@m_ryabinin

2 months

Very insightful blogpost! IMO tokenization is a part of NLP pipelines which receives way less attention than it should As an aside, while reading the summary of SuperBPE, I realized that space-agnostic tokenization and other recent improvements go way back to pre-LLM times:

George Grigorev

@iamgrigorev

2 months

Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as

1

0

3

Max Ryabinin

@m_ryabinin

2 months

Decentralized DL projects built on top of Hivemind ( https://t.co/9x9RmCg2A0): * node0 by @PluralisHQ * OpenDiloco by @PrimeIntellect * rl-swarm by @gensynai If you know any others, please share them with me! Would love to help more researchers interested in this field

github.com

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world. - learning-at-home/hivemind

2

7

17

Max Ryabinin

@m_ryabinin

2 months

With SWARM, we've shown back in 2023 that pipeline-parallel training over the Internet is feasible: https://t.co/0pLvJ05ea3 IMO pure data parallelism isn't enough if we want to train truly big models outside of clusters. To the best of my knowledge, @PluralisHQ are the only ones

Max Ryabinin

@m_ryabinin

3 years

We present SWARM, an efficient algorithm for model-parallel training across the Internet (e.g. with volunteers). Key advantages: 💎 Fault-tolerant ⚖️ Self-balancing on slow GPUs/networks 🐌 Works in low-bandwidth setups 📜 https://t.co/wCnf6vDCv4 🖥️ https://t.co/pVe0GDmfrK

1

3

19

Max Ryabinin

@m_ryabinin

2 months

Fault-tolerant pipeline parallelism (SWARM/Petals-style) with compression, scaled to an open pretraining run! On a personal note, I'm really glad that Hivemind powers a few parts of this project — both for pipelines and for robust data parallelism

Pluralis Research

@PluralisHQ

2 months

Node-0-7.5B is live. It is a permissionless, multi-participant, model-parallel pretraining run over the open internet. Anyone with a 16GB+ GPU can join. Node-0 allows participants to collaboratively train a model far larger than could be done as individuals.

1

16

64

Dan Fu

@realDanFu

4 months

On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) https://t.co/w2nhjqNxPb 2/

ES-FoMo@ICML2025

@ESFoMo

6 months

ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇

1

3

15

Max Ryabinin

@m_ryabinin

4 months

Sadly, I won't be at the conference, but come meet @jackminong who led this project!

Jackmin

@jackminong

4 months

Toploc Poster session tomorrow (Wed) at 4:30 PM East Hall E-1106 I’ll be around through Saturday; if you’re into decentralized training & inference, lets chat!

0

4

Max Ryabinin

@m_ryabinin

4 months

Here's the original thread about TOPLOC that explains the ideas behind it:

Prime Intellect

@PrimeIntellect

10 months

Today, we release TOPLOC: A Locality Sensitive Hashing Scheme for Verifiable Inference - Detects modifications to models, prompts, or precision - Robust across GPU types, tensor parallel configurations and attention kernels - Up to 100× faster validation than generation -

1

0

4

Max Ryabinin

@m_ryabinin

4 months

If you're at ICML and interested in verifiable inference, make sure to stop by our poster! We will present TOPLOC, an efficient activation hashing method that works across a variety of settings, e.g. switching inference setups or even models. July 16, 4:30pm, E-1106

1

2

11

Max Ryabinin

@m_ryabinin

4 months

From my experience, getting a paper on decentralized DL accepted to top-level conferences can be quite tough. The motivation is not familiar to many reviewers, and standard experiment settings don't account for the problems you aim to solve. Hence, I'm very excited to see

Alexander Long

@AlexanderLong

4 months

For people not familiar with AI publishing; there are 3 main conferences every year. ICML, ICLR and NeurIPS. These are technical conferences and the equivalent of journals in other disciplines - they are the main publishing venue for AI. The competition to have papers at these

2

7

43

Cohere Labs

@Cohere_Labs

5 months

@gowthami_s @JangLawrenceK @IAmTimNguyen @ishapuri101 Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: https://t.co/bmxbv7AwBZ

1

6

30

Max Ryabinin

@m_ryabinin

5 months

Very grateful to have an opportunity to meet researchers from @CaMLSys/@flwrlabs and share some current thoughts on decentralized and communication-efficient deep learning. Thanks to @niclane7 for the invitation!

nic lane

@niclane7

5 months

Looking forward to spending the day with @m_ryabinin, one of the leading figures in decentralized AI. Amazing talk for those nearby Thanks for visiting @CaMLSys Max!

0

9

Max Ryabinin

@m_ryabinin

5 months

Thanks a lot to Ferdinand for hosting this conversation! It was a great opportunity to overview all parts of SWARM and discuss the motivation behind them in depth. I hope this video will make decentralized DL more accessible: many ideas in the field are simpler than they seem!

Ferdinand Mom

@FerdinandMom

6 months

The research paper video review on "Swarm Parallelism" along with the author @m_ryabinin, Distinguished Research Scientist @togethercompute is now out ! Link below 👇 For context, most decentralized training today follows DDP-style approaches requiring full model replication on

0

5

18

samsja

@samsja19

6 months

@lucasmaes_ There is a lot to dig in, the latest prime intellect paper are very up to date in term of scale / sota. To get deep into the field I suggest reading paper from @m_ryabinin @Ar_Douillard and Martin Jaggi some paper https://t.co/wU07ju07du https://t.co/ujjfsfVzXN

arxiv.org

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected...

2

5

17

Dan Alistarh

@DAlistarh

6 months

We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]

20

78

399

Max Ryabinin

@m_ryabinin

6 months

Looking forward to discussing SWARM next Monday, thanks to @FerdinandMom for the invite! Many works about Internet-scale DL target communication savings, but once you want to train large models over random GPUs, other challenges arise. Turns out that pipelining can help here!

Ferdinand Mom

@FerdinandMom

6 months

Most decentralized training today follows DDP-style approaches requiring full model replication on each node. While practical for those with H100 clusters at their disposal, this remains out of reach for the vast majority of potential contributors. Delving back into the

0

4

24

ES-FoMo@ICML2025

@ESFoMo

6 months

ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇

8

10

18

Max Ryabinin

@m_ryabinin

7 months

@Ar_Douillard There is also a lot of relevant ideas from earlier work in async/distributed RL, e.g. A3C ( https://t.co/1ABeddNUhK) or IMPALA ( https://t.co/oYj3yguHWA) I wonder if some methods or learnings from that era could find novel use for RL+LLMs: certain challenges could be quite similar

1

3

15