nathanbarrydev Profile Banner
Nathan Barry Profile
Nathan Barry

@nathanbarrydev

Followers
2K
Following
27K
Media
54
Statuses
335

Man in the Arena Allocator. Prev @Apple, CS + Math @UTAustin, @zfellows

Austin, TX
Joined June 2020
Don't wanna be here? Send us removal request.
@nathanbarrydev
Nathan Barry
5 days
Playing around with training a tiny 11M parameter character-level text diffusion model! It's a WIP but the code is currently a heavily modified nanochat gpt implementation (to change from autoregressive decoding to diffusion) and trained on the Tiny Shakespeare dataset. The
55
174
2K
@nathanbarrydev
Nathan Barry
3 days
Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding”
15
50
521
@karpathy
Andrej Karpathy
7 days
Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've
@nathanbarrydev
Nathan Barry
7 days
BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first
262
576
5K
@nathanbarrydev
Nathan Barry
7 days
(4/n) We can see that BERT’s masked language modeling objective is the same as text diffusion, but just for a subset of masking rates. By having variable masking rates (from 0 to 1), we can transform BERT’s masked language modeling objective into a full generative procedure.
2
3
51
@nathanbarrydev
Nathan Barry
7 days
(3/n) Applying this idea to language means we need a way to add noise to text and then remove it in stages. The simplest way to do this is a masking‐based noise process: For the forward process, you initially have uncorrupted text. At each iteration, you randomly replace a
1
3
59
@nathanbarrydev
Nathan Barry
7 days
(2/n) Diffusion models were first popularized in image generation. In image generation, diffusion models gradually add Gaussian noise to an image (forward process) and then train a neural network to iteratively denoise it (reverse process)
1
2
63
@nathanbarrydev
Nathan Barry
7 days
BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first
12
98
835
@nathanbarrydev
Nathan Barry
8 days
(6/n) I have a bunch of thoughts and experiments to test out. It seems like from this ablation study (and what I've heard from other researchers), differing inner steps seems to not have much of a negative impact. This sounds too good to be true but would be a massive free lunch
0
0
1
@nathanbarrydev
Nathan Barry
8 days
(5/n) In the version where inner step size was scaled to match worker speed, I would have still imagined some impact on convergence due to a different set of reasons. Because outer-gradients (really parameter differences) have similar behavior to normal gradient, let's think
1
0
1
@nathanbarrydev
Nathan Barry
8 days
(4/n) Earlier in the paper, they saw that just the inherent staleness alone, which comes from applying individual worker updates sequentially instead of averaging them and applying it once, lead to "considerable performance drops." So I would have imagined that significantly
1
0
1
@nathanbarrydev
Nathan Barry
8 days
(3/n) Their results showed that perplexity wasn't really effected by level of heterogeneity. I found this surprising. In the naive version, a worker that is twice as slow will apply outer-gradients that are twice as stale, since each worker takes the same number of inner steps.
1
0
1
@nathanbarrydev
Nathan Barry
8 days
(2/n) In Async Local-SGD (aka Async DiLoCo) looked at how heterogeneous worker speeds effected perplexity. In the naive implementation, each worker had the same number of inner steps. In the improved version, inner steps was scaled relative to worker speed. Results below
1
0
0
@nathanbarrydev
Nathan Barry
8 days
(1/n) A desirable problem to solve is being able to train on heterogeneous hardware. Even within the same generation, NVIDIA B300 GPUs are 50% faster than B200s. Companies with many clusters (Meta, Google, etc) would ideally be able to train a model across their clusters
1
0
6
@nathanbarrydev
Nathan Barry
10 days
Their open-source simulator currently supports DiLoCo, Async Local-SGD, and HALoS. It would not be hard to add Overlap Local-SGD or One-step-delay/Eager Update DiLoCo (although Streaming DiLoCo would require a major rewrite). I'll be building off it to run my experiments (9/n)
1
0
3
@nathanbarrydev
Nathan Barry
10 days
When a LPS communicates with the GPS, because of the lower bandwidth and increased latency, the LPS continues to apply updates from its workers. When it receives the updated parameters from the GPS, it merges it with its updated local parameters instead of replacing them. (8/n)
1
0
0
@nathanbarrydev
Nathan Barry
10 days
One way to think about HALoS is that we are running multiple instances of Async Local-SGD, each having multiple workers within the same region. We treat each LPS as a normal Async Local-SGD worker and have another parameter server (the GPS) which they send updates to. (7/n)
1
0
0
@nathanbarrydev
Nathan Barry
10 days
HALoS doesn’t directly address the staleness issue. Instead, it focuses on minimizing computation idle time. HALoS introduces Local Parameter Servers (LPS) within each region and a global parameter server (GPS) which merges updates across regions. (6/n)
1
0
0