Nolan Dey
@DeyNolan
Followers
459
Following
24
Media
14
Statuses
34
Research Scientist @ Cerebras Systems
Toronto
Joined March 2022
Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. https://t.co/C8X1C3hUWL
1
1
7
(1/4) @CerebrasSystems Hot off the presses 🔥📄 https://t.co/ahPvKCFN9g If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.
2
8
25
Power Lines paper now out: https://t.co/AwAgxyM735 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.
arxiv.org
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...
1
19
96
(7/7) If you are looking to conduct research into deep models, contact us to collaborate! We are also hiring research scientists ( https://t.co/QG7zTeH5Uc) and research engineers ( https://t.co/Y0RKb6KpK6)!
job-boards.greenhouse.io
0
4
28
(6/7) Implementing CompleteP is very simple, only requiring two lines of code. We provide a minimal implementation here: https://t.co/hgFMJBATH8
4
2
31
(5/7) We propose a novel criterion called complete feature learning, which states that as depth increases, a model should not collapse to its linearization. Only CompleteP (α = 1) ensures both stable training and complete feature learning.
1
3
25
(4/7) CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts.
1
1
20
(3/7) The deeper the model, the more FLOP savings CompleteP (α= 1) has over µP. In 1.5B models, CompleteP saves 11.8% of FLOPs for optimal N:L and 34.4% FLOPs in the deepest models.
1
1
23
(2/7) CompleteP enables optimum HPs to remain stable for any depth, unlike popular parameterizations like SP, μP, and α=0.5. This dramatically reduces HP tuning budgets for deep models.
1
1
29
(1/7) @CerebrasSystems Paper drop: https://t.co/dCATF7nMCp TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇
12
67
409
Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. https://t.co/T4Gokt5aAk
0
0
3
🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵
2
28
183
(1/n) Paper drop: https://t.co/fcr3Jr2ckD TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇
4
36
177
Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: https://t.co/D6YJuwgaT2. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.
github.com
Port of Andrej Karpathy's nanoGPT to Apple MLX framework. - vithursant/nanoGPT_mlx
3
16
94
📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇
2
16
75
We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay https://t.co/cc4ANJwTrZ
2
27
112
Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌
4
52
196
📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵 https://t.co/bwsSz4d9hs
13
181
658
🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. https://t.co/unHgSK8m2s
0
5
18