Shane Bergsma @ShaneBergsma X Profile

Shane Bergsma

@ShaneBergsma

Followers

302

Following

282

Media

9

Statuses

143

Man bites data

https://t.co/7ptGnlmUda

Toronto, Ontario

Joined February 2012

Don't wanna be here? Send us removal request.

Shane Bergsma

@ShaneBergsma

28 days

Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. https://t.co/C8X1C3hUWL

1

7

Atli Kosson

@AtliKosson

8 days

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

11

48

329

Shane Bergsma

@ShaneBergsma

28 days

Fig 1: • Valley in TREC ≠ train-loss minimum → best spot for HQ data • Shape tracks AdamW τ (via weight decay) • Curves align across 1000× scaling at fixed τ (and TPP)

0

Shikai Qiu

@ShikaiQiu

1 month

Beautiful work on pretraining science using scaling collapse to precisely predict, debug, and tune LLM training from small-scale and partial runs. So much insights on going beyond μP!

Shane Bergsma

@ShaneBergsma

1 month

(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

2

12

Shane Bergsma

@ShaneBergsma

1 month

@ShikaiQiu @Locchiu @andrewgwils @xidulu @laurence_ai (4/4) Our Power Lines (NeurIPS 2025) showed τ* is set by TPP. 👉 Fixed TPP + optimal τ ⇒collapse emerges naturally. With Claire Zhang, @DeyNolan, Shaheer Muhammad, @gurpreetgosal_, Joel Hestness we trained Celerity on this recipe…it does collapse, and sits on compute frontier.

0

2

Shane Bergsma

@ShaneBergsma

1 month

@ShikaiQiu @Locchiu @andrewgwils (3/4) Collapse in LLMs needs 3 aligned controls: • same LR schedule • same TPP • same optimizer timescale τ (@xidulu @laurence_ai) Sweep B, λ, or η → same τ ⇒ same curve. Sweep τ itself → curves peel apart.

1

3

Shane Bergsma

@ShaneBergsma

1 month

(2/4) Earlier work by @ShikaiQiu @Locchiu @andrewgwils J. Pennington & A. Agarwala showed collapse at small scale and called for testing it in full LLM-scale ladders. ✅ We did it.

1

0

3

Shane Bergsma

@ShaneBergsma

1 month

(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

2

8

25

Daria Soboleva

@dmsobol

5 months

Major finding #1: λ=0.1 used in the majority of LLMs is suboptimal! Our work shows that optimal weight decay (λ) scales linearly with batch size. Most researchers use the same λ regardless of batch size, leaving performance on the table.

1

5

Shane Bergsma

@ShaneBergsma

5 months

Shoutout EMA view of AdamW @xidulu @laurence_ai 😆

0

1

5

Shane Bergsma

@ShaneBergsma

5 months

Power Lines paper now out: https://t.co/AwAgxyM735 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

arxiv.org

Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...

1

19

96

Nolan Dey

@DeyNolan

6 months

(1/7) @CerebrasSystems Paper drop: https://t.co/dCATF7nMCp TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

12

67

409

Cerebras

@cerebras

6 months

It’s #ICLR2025 week, and we’re proud to share that Team Cerebras will be presenting their paper: "Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs" at @iclr_conf! Big congrats to the authors, your work is powering the future of AI compute.

2

5

32

Artificial Analysis

@ArtificialAnlys

1 year

Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s. @CerebrasSystems has just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips. Cerebras Inference is

12

67

308

Shane Bergsma

@ShaneBergsma

7 years

It's never a bad idea to check,

1

0

1

Shane Bergsma

@ShaneBergsma

7 years

OMG, now food trucks are even part of the A.I. bandwagon!

1

0

2

Shane Bergsma

@ShaneBergsma

8 years

My son, after reading half the books: "J.R.R. Tolkien is a man? I had no idea." Than you, @jk_rowling

0

1

8

@vyedin

8 years

In an effort to foster a more cooperative spirit between different parts of my code, I no longer pass *arguments* to a function. Instead when one function calls another, it passes along some *gentle feedback*.

39

515

2K

Shane Bergsma

@ShaneBergsma

8 years

The whole group? Wow, this migration of academics to industry is getting out of control.

0

9

26

Shane Bergsma

@ShaneBergsma

9 years

Wikipedia (one of the supreme achievements of humanity) doesn't get enough love, so just let me say, "thank you, Wikipedia."

0

1