Nathan Barry
@nathanbarrydev
Followers
2K
Following
27K
Media
54
Statuses
335
Man in the Arena Allocator. Prev @Apple, CS + Math @UTAustin, @zfellows
Austin, TX
Joined June 2020
Playing around with training a tiny 11M parameter character-level text diffusion model! It's a WIP but the code is currently a heavily modified nanochat gpt implementation (to change from autoregressive decoding to diffusion) and trained on the Tiny Shakespeare dataset. The
55
174
2K
Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding”
15
50
521
Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've
BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first
262
576
5K
(5/n) The rest of the post is how I fine-tuned RoBERTa to do text diffusion! Read the full post here: https://t.co/eYZagfxS35
nathan.rs
A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time,...
0
6
141
(4/n) We can see that BERT’s masked language modeling objective is the same as text diffusion, but just for a subset of masking rates. By having variable masking rates (from 0 to 1), we can transform BERT’s masked language modeling objective into a full generative procedure.
2
3
51
(3/n) Applying this idea to language means we need a way to add noise to text and then remove it in stages. The simplest way to do this is a masking‐based noise process: For the forward process, you initially have uncorrupted text. At each iteration, you randomly replace a
1
3
59
(2/n) Diffusion models were first popularized in image generation. In image generation, diffusion models gradually add Gaussian noise to an image (forward process) and then train a neural network to iteratively denoise it (reverse process)
1
2
63
BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first
12
98
835
(6/n) I have a bunch of thoughts and experiments to test out. It seems like from this ablation study (and what I've heard from other researchers), differing inner steps seems to not have much of a negative impact. This sounds too good to be true but would be a massive free lunch
0
0
1
(5/n) In the version where inner step size was scaled to match worker speed, I would have still imagined some impact on convergence due to a different set of reasons. Because outer-gradients (really parameter differences) have similar behavior to normal gradient, let's think
1
0
1
(4/n) Earlier in the paper, they saw that just the inherent staleness alone, which comes from applying individual worker updates sequentially instead of averaging them and applying it once, lead to "considerable performance drops." So I would have imagined that significantly
1
0
1
(3/n) Their results showed that perplexity wasn't really effected by level of heterogeneity. I found this surprising. In the naive version, a worker that is twice as slow will apply outer-gradients that are twice as stale, since each worker takes the same number of inner steps.
1
0
1
(2/n) In Async Local-SGD (aka Async DiLoCo) looked at how heterogeneous worker speeds effected perplexity. In the naive implementation, each worker had the same number of inner steps. In the improved version, inner steps was scaled relative to worker speed. Results below
1
0
0
(1/n) A desirable problem to solve is being able to train on heterogeneous hardware. Even within the same generation, NVIDIA B300 GPUs are 50% faster than B200s. Companies with many clusters (Meta, Google, etc) would ideally be able to train a model across their clusters
1
0
6
Their open-source simulator currently supports DiLoCo, Async Local-SGD, and HALoS. It would not be hard to add Overlap Local-SGD or One-step-delay/Eager Update DiLoCo (although Streaming DiLoCo would require a major rewrite). I'll be building off it to run my experiments (9/n)
1
0
3
When a LPS communicates with the GPS, because of the lower bandwidth and increased latency, the LPS continues to apply updates from its workers. When it receives the updated parameters from the GPS, it merges it with its updated local parameters instead of replacing them. (8/n)
1
0
0
One way to think about HALoS is that we are running multiple instances of Async Local-SGD, each having multiple workers within the same region. We treat each LPS as a normal Async Local-SGD worker and have another parameter server (the GPS) which they send updates to. (7/n)
1
0
0
HALoS doesn’t directly address the staleness issue. Instead, it focuses on minimizing computation idle time. HALoS introduces Local Parameter Servers (LPS) within each region and a global parameter server (GPS) which merges updates across regions. (6/n)
1
0
0