ShaneBergsma Profile Banner
Shane Bergsma Profile
Shane Bergsma

@ShaneBergsma

Followers
302
Following
282
Media
9
Statuses
143

Man bites data

Toronto, Ontario
Joined February 2012
Don't wanna be here? Send us removal request.
@ShaneBergsma
Shane Bergsma
28 days
Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. https://t.co/C8X1C3hUWL
1
1
7
@AtliKosson
Atli Kosson
8 days
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
11
48
329
@ShaneBergsma
Shane Bergsma
28 days
Fig 1: • Valley in TREC ≠ train-loss minimum → best spot for HQ data • Shape tracks AdamW τ (via weight decay) • Curves align across 1000× scaling at fixed τ (and TPP)
0
0
0
@ShikaiQiu
Shikai Qiu
1 month
Beautiful work on pretraining science using scaling collapse to precisely predict, debug, and tune LLM training from small-scale and partial runs. So much insights on going beyond μP!
@ShaneBergsma
Shane Bergsma
1 month
(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.
2
2
12
@ShaneBergsma
Shane Bergsma
1 month
@ShikaiQiu @Locchiu @andrewgwils @xidulu @laurence_ai (4/4) Our Power Lines (NeurIPS 2025) showed τ* is set by TPP. 👉 Fixed TPP + optimal τ ⇒collapse emerges naturally. With Claire Zhang, @DeyNolan, Shaheer Muhammad, @gurpreetgosal_, Joel Hestness we trained Celerity on this recipe…it does collapse, and sits on compute frontier.
0
0
2
@ShaneBergsma
Shane Bergsma
1 month
@ShikaiQiu @Locchiu @andrewgwils (3/4) Collapse in LLMs needs 3 aligned controls: • same LR schedule • same TPP • same optimizer timescale τ (@xidulu @laurence_ai) Sweep B, λ, or η → same τ ⇒ same curve. Sweep τ itself → curves peel apart.
1
1
3
@ShaneBergsma
Shane Bergsma
1 month
(2/4) Earlier work by @ShikaiQiu @Locchiu @andrewgwils J. Pennington & A. Agarwala showed collapse at small scale and called for testing it in full LLM-scale ladders. ✅ We did it.
1
0
3
@ShaneBergsma
Shane Bergsma
1 month
(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.
2
8
25
@dmsobol
Daria Soboleva
5 months
Major finding #1: λ=0.1 used in the majority of LLMs is suboptimal! Our work shows that optimal weight decay (λ) scales linearly with batch size. Most researchers use the same λ regardless of batch size, leaving performance on the table.
1
1
5
@ShaneBergsma
Shane Bergsma
5 months
Shoutout EMA view of AdamW @xidulu @laurence_ai 😆
0
1
5
@ShaneBergsma
Shane Bergsma
5 months
Power Lines paper now out: https://t.co/AwAgxyM735 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.
Tweet card summary image
arxiv.org
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...
1
19
96
@DeyNolan
Nolan Dey
6 months
(1/7) @CerebrasSystems Paper drop: https://t.co/dCATF7nMCp TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇
12
67
409
@cerebras
Cerebras
6 months
It’s #ICLR2025 week, and we’re proud to share that Team Cerebras will be presenting their paper: "Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs" at @iclr_conf! Big congrats to the authors, your work is powering the future of AI compute.
2
5
32
@ArtificialAnlys
Artificial Analysis
1 year
Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s. @CerebrasSystems has just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips. Cerebras Inference is
12
67
308
@ShaneBergsma
Shane Bergsma
7 years
It's never a bad idea to check,
1
0
1
@ShaneBergsma
Shane Bergsma
7 years
OMG, now food trucks are even part of the A.I. bandwagon!
1
0
2
@ShaneBergsma
Shane Bergsma
8 years
My son, after reading half the books: "J.R.R. Tolkien is a man? I had no idea." Than you, @jk_rowling
0
1
8
@vyedin
@vyedin
8 years
In an effort to foster a more cooperative spirit between different parts of my code, I no longer pass *arguments* to a function. Instead when one function calls another, it passes along some *gentle feedback*.
39
515
2K
@ShaneBergsma
Shane Bergsma
8 years
The whole group? Wow, this migration of academics to industry is getting out of control.
0
9
26
@ShaneBergsma
Shane Bergsma
9 years
Wikipedia (one of the supreme achievements of humanity) doesn't get enough love, so just let me say, "thank you, Wikipedia."
0
0
1