Nolan Dey @DeyNolan X Profile

Nolan Dey

@DeyNolan

Followers

459

Following

24

Media

14

Statuses

34

Research Scientist @ Cerebras Systems

https://t.co/NlknpIhPj8

Toronto

Joined March 2022

Don't wanna be here? Send us removal request.

Shane Bergsma

@ShaneBergsma

24 days

Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. https://t.co/C8X1C3hUWL

1

7

Shane Bergsma

@ShaneBergsma

27 days

(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

2

8

25

Shane Bergsma

@ShaneBergsma

5 months

Power Lines paper now out: https://t.co/AwAgxyM735 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

arxiv.org

Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...

1

19

96

Nolan Dey

@DeyNolan

6 months

(7/7) If you are looking to conduct research into deep models, contact us to collaborate! We are also hiring research scientists ( https://t.co/QG7zTeH5Uc) and research engineers ( https://t.co/Y0RKb6KpK6)!

job-boards.greenhouse.io

0

4

28

Nolan Dey

@DeyNolan

6 months

(6/7) Implementing CompleteP is very simple, only requiring two lines of code. We provide a minimal implementation here: https://t.co/hgFMJBATH8

4

2

31

Nolan Dey

@DeyNolan

6 months

(5/7) We propose a novel criterion called complete feature learning, which states that as depth increases, a model should not collapse to its linearization. Only CompleteP (α = 1) ensures both stable training and complete feature learning.

1

3

25

Nolan Dey

@DeyNolan

6 months

(4/7) CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts.

1

20

Nolan Dey

@DeyNolan

6 months

(3/7) The deeper the model, the more FLOP savings CompleteP (α= 1) has over µP. In 1.5B models, CompleteP saves 11.8% of FLOPs for optimal N:L and 34.4% FLOPs in the deepest models.

1

23

Nolan Dey

@DeyNolan

6 months

(2/7) CompleteP enables optimum HPs to remain stable for any depth, unlike popular parameterizations like SP, μP, and α=0.5. This dramatically reduces HP tuning budgets for deep models.

1

29

Nolan Dey

@DeyNolan

6 months

(1/7) @CerebrasSystems Paper drop: https://t.co/dCATF7nMCp TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

12

67

409

Nolan Dey

@DeyNolan

7 months

Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. https://t.co/T4Gokt5aAk

0

3

EleutherAI

@AiEleuther

1 year

🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵

2

28

183

Davis Blalock

@davisblalock

1 year

So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...

Cerebras

@cerebras

1 year

(6/n) Applying SμPar to pretraining a 610M parameter LLM significantly improves loss over SP and μP models due to improved HP tuning.

4

9

114

Cerebras

@cerebras

1 year

(1/n) Paper drop: https://t.co/fcr3Jr2ckD TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇

4

36

177

Vithu Thangarasa

@vithursant19

2 years

Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: https://t.co/D6YJuwgaT2. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.

github.com

Port of Andrej Karpathy's nanoGPT to Apple MLX framework. - vithursant/nanoGPT_mlx

3

16

94

Cerebras

@cerebras

2 years

📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇

2

16

75

Cerebras

@cerebras

2 years

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay https://t.co/cc4ANJwTrZ

2

27

112

Cerebras

@cerebras

2 years

Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌

4

52

196

Cerebras

@cerebras

2 years

📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵 https://t.co/bwsSz4d9hs

13

181

658

Cerebras

@cerebras

3 years

🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. https://t.co/unHgSK8m2s

0

5

18