Nolan Dey Profile
Nolan Dey

@DeyNolan

Followers
438
Following
22
Media
14
Statuses
32

Research Scientist @ Cerebras Systems

Toronto
Joined March 2022
Don't wanna be here? Send us removal request.
@DeyNolan
Nolan Dey
3 months
(7/7) If you are looking to conduct research into deep models, contact us to collaborate! We are also hiring research scientists ( and research engineers (!.
Tweet card summary image
job-boards.greenhouse.io
0
4
28
@DeyNolan
Nolan Dey
3 months
(6/7) Implementing CompleteP is very simple, only requiring two lines of code. We provide a minimal implementation here:
Tweet media one
4
2
31
@DeyNolan
Nolan Dey
3 months
(5/7) We propose a novel criterion called complete feature learning, which states that as depth increases, a model should not collapse to its linearization. Only CompleteP (α = 1) ensures both stable training and complete feature learning.
Tweet media one
1
3
25
@DeyNolan
Nolan Dey
3 months
(4/7) CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts.
Tweet media one
1
1
20
@DeyNolan
Nolan Dey
3 months
(3/7) The deeper the model, the more FLOP savings CompleteP (α= 1) has over µP. In 1.5B models, CompleteP saves 11.8% of FLOPs for optimal N:L and 34.4% FLOPs in the deepest models.
Tweet media one
1
1
23
@DeyNolan
Nolan Dey
3 months
(2/7) CompleteP enables optimum HPs to remain stable for any depth, unlike popular parameterizations like SP, μP, and α=0.5. This dramatically reduces HP tuning budgets for deep models.
Tweet media one
1
1
29
@DeyNolan
Nolan Dey
3 months
(1/7) @CerebrasSystems Paper drop: TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇
Tweet media one
12
67
408
@DeyNolan
Nolan Dey
4 months
Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025!. TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons.
0
0
3
@DeyNolan
Nolan Dey
11 months
RT @AiEleuther: 🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransf….
0
28
0
@DeyNolan
Nolan Dey
1 year
RT @davisblalock: So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varyi….
0
9
0
@DeyNolan
Nolan Dey
1 year
RT @CerebrasSystems: (1/n) Paper drop: TLDR: We introduce the sparse maximal update parameterization (SμPar), whic….
0
36
0
@DeyNolan
Nolan Dey
2 years
RT @vithursant19: Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training….
Tweet card summary image
github.com
Port of Andrej Karpathy's nanoGPT to Apple MLX framework. - vithursant/nanoGPT_mlx
0
16
0
@DeyNolan
Nolan Dey
2 years
RT @CerebrasSystems: 📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation . We found a simple method to 2x the context lengt….
0
16
0
@DeyNolan
Nolan Dey
2 years
RT @CerebrasSystems: We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs:.- Extensively deduplicat….
0
27
0
@DeyNolan
Nolan Dey
2 years
RT @CerebrasSystems: Cerebras BTLM-3B-8K model crosses 1M downloads🤯.It's the #1 ranked 3B language model on @huggingface!.A big thanks to….
0
52
0
@DeyNolan
Nolan Dey
2 years
RT @CerebrasSystems: 📣 New dataset drop!.Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source data….
0
183
0
@DeyNolan
Nolan Dey
2 years
RT @CerebrasSystems: 🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train….
0
5
0
@DeyNolan
Nolan Dey
2 years
You can fine-tune or pre-train on a Cerebras Wafer-Scale Cluster today using Cerebras AI Model Studio: Very proud of the @CerebrasSystems team for this work. Happy to answer any questions here!
Tweet media one
1
3
9
@DeyNolan
Nolan Dey
2 years
By using weight-streaming, Cerebras systems can scale from 1B to 1T parameters without any alteration in the model or training workflow. Cerebras-GPT provides the first demonstration of this type of scaling.
Tweet card summary image
cerebras.net
In a single keystroke, Cerebras can scale large language models from a single CS-2 system to 192 CS-2s in a Cerebras Wafer-Scale Cluster.
1
1
10