Nathan Barry @nathanbarrydev X Profile

Nathan Barry

@nathanbarrydev

Followers

2K

Following

27K

Media

54

Statuses

335

Man in the Arena Allocator. Prev @Apple, CS + Math @UTAustin, @zfellows

https://t.co/0JZ5LflYni

Austin, TX

Joined June 2020

Don't wanna be here? Send us removal request.

Nathan Barry

@nathanbarrydev

5 days

Playing around with training a tiny 11M parameter character-level text diffusion model! It's a WIP but the code is currently a heavily modified nanochat gpt implementation (to change from autoregressive decoding to diffusion) and trained on the Tiny Shakespeare dataset. The

55

174

2K

Nathan Barry

@nathanbarrydev

3 days

Repo:

github.com

Contribute to nathan-barry/tiny-diffusion development by creating an account on GitHub.

0

1

23

Nathan Barry

@nathanbarrydev

3 days

Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding”

15

50

521

Andrej Karpathy

@karpathy

7 days

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've

Nathan Barry

@nathanbarrydev

7 days

BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first

262

576

5K

Nathan Barry

@nathanbarrydev

7 days

(5/n) The rest of the post is how I fine-tuned RoBERTa to do text diffusion! Read the full post here: https://t.co/eYZagfxS35

nathan.rs

A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time,...

0

6

141

Nathan Barry

@nathanbarrydev

7 days

(4/n) We can see that BERT’s masked language modeling objective is the same as text diffusion, but just for a subset of masking rates. By having variable masking rates (from 0 to 1), we can transform BERT’s masked language modeling objective into a full generative procedure.

2

3

51

Nathan Barry

@nathanbarrydev

7 days

(3/n) Applying this idea to language means we need a way to add noise to text and then remove it in stages. The simplest way to do this is a masking‐based noise process: For the forward process, you initially have uncorrupted text. At each iteration, you randomly replace a

1

3

59

Nathan Barry

@nathanbarrydev

7 days

(2/n) Diffusion models were first popularized in image generation. In image generation, diffusion models gradually add Gaussian noise to an image (forward process) and then train a neural network to iteratively denoise it (reverse process)

1

2

63

Nathan Barry

@nathanbarrydev

7 days

BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first

12

98

835

Nathan Barry

@nathanbarrydev

8 days

(6/n) I have a bunch of thoughts and experiments to test out. It seems like from this ablation study (and what I've heard from other researchers), differing inner steps seems to not have much of a negative impact. This sounds too good to be true but would be a massive free lunch

0

1

Nathan Barry

@nathanbarrydev

8 days

(5/n) In the version where inner step size was scaled to match worker speed, I would have still imagined some impact on convergence due to a different set of reasons. Because outer-gradients (really parameter differences) have similar behavior to normal gradient, let's think

1

0

1

Nathan Barry

@nathanbarrydev

8 days

(4/n) Earlier in the paper, they saw that just the inherent staleness alone, which comes from applying individual worker updates sequentially instead of averaging them and applying it once, lead to "considerable performance drops." So I would have imagined that significantly

1

0

1

Nathan Barry

@nathanbarrydev

8 days

(3/n) Their results showed that perplexity wasn't really effected by level of heterogeneity. I found this surprising. In the naive version, a worker that is twice as slow will apply outer-gradients that are twice as stale, since each worker takes the same number of inner steps.

1

0

1

Nathan Barry

@nathanbarrydev

8 days

(2/n) In Async Local-SGD (aka Async DiLoCo) looked at how heterogeneous worker speeds effected perplexity. In the naive implementation, each worker had the same number of inner steps. In the improved version, inner steps was scaled relative to worker speed. Results below

1

0

Nathan Barry

@nathanbarrydev

8 days

(1/n) A desirable problem to solve is being able to train on heterogeneous hardware. Even within the same generation, NVIDIA B300 GPUs are 50% faster than B200s. Companies with many clusters (Meta, Google, etc) would ideally be able to train a model across their clusters

1

0

6

Nathan Barry

@nathanbarrydev

10 days

Full entry: https://t.co/ivXZfTM6Ul

nathan.rs

I’m doing my master’s thesis around distributed low-communication training. Essentially, how can we train large models efficiently across distributed nodes and not be utterly destroyed by network...

0

2

Nathan Barry

@nathanbarrydev

10 days

Their open-source simulator currently supports DiLoCo, Async Local-SGD, and HALoS. It would not be hard to add Overlap Local-SGD or One-step-delay/Eager Update DiLoCo (although Streaming DiLoCo would require a major rewrite). I'll be building off it to run my experiments (9/n)

1

0

3

Nathan Barry

@nathanbarrydev

10 days

When a LPS communicates with the GPS, because of the lower bandwidth and increased latency, the LPS continues to apply updates from its workers. When it receives the updated parameters from the GPS, it merges it with its updated local parameters instead of replacing them. (8/n)

1

0

Nathan Barry

@nathanbarrydev

10 days

One way to think about HALoS is that we are running multiple instances of Async Local-SGD, each having multiple workers within the same region. We treat each LPS as a normal Async Local-SGD worker and have another parameter server (the GPS) which they send updates to. (7/n)

1

0

Nathan Barry

@nathanbarrydev

10 days

HALoS doesn’t directly address the staleness issue. Instead, it focuses on minimizing computation idle time. HALoS introduces Local Parameter Servers (LPS) within each region and a global parameter server (GPS) which merges updates across regions. (6/n)

1

0