
Nolan Dey
@DeyNolan
Followers
438
Following
22
Media
14
Statuses
32
Research Scientist @ Cerebras Systems
Toronto
Joined March 2022
RT @ShaneBergsma: Power Lines paper now out: TL;DR - we identify how AdamW's weight decay should scale with batch….
arxiv.org
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...
0
19
0
(7/7) If you are looking to conduct research into deep models, contact us to collaborate! We are also hiring research scientists ( and research engineers (!.
job-boards.greenhouse.io
0
4
28
(1/7) @CerebrasSystems Paper drop: TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇
12
67
408
RT @AiEleuther: 🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransf….
0
28
0
RT @davisblalock: So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varyi….
0
9
0
RT @CerebrasSystems: (1/n) Paper drop: TLDR: We introduce the sparse maximal update parameterization (SμPar), whic….
0
36
0
RT @vithursant19: Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training….
github.com
Port of Andrej Karpathy's nanoGPT to Apple MLX framework. - vithursant/nanoGPT_mlx
0
16
0
RT @CerebrasSystems: 📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation . We found a simple method to 2x the context lengt….
0
16
0
RT @CerebrasSystems: We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs:.- Extensively deduplicat….
0
27
0
RT @CerebrasSystems: Cerebras BTLM-3B-8K model crosses 1M downloads🤯.It's the #1 ranked 3B language model on @huggingface!.A big thanks to….
0
52
0
RT @CerebrasSystems: 📣 New dataset drop!.Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source data….
0
183
0
RT @CerebrasSystems: 🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train….
0
5
0
You can fine-tune or pre-train on a Cerebras Wafer-Scale Cluster today using Cerebras AI Model Studio: Very proud of the @CerebrasSystems team for this work. Happy to answer any questions here!
1
3
9
By using weight-streaming, Cerebras systems can scale from 1B to 1T parameters without any alteration in the model or training workflow. Cerebras-GPT provides the first demonstration of this type of scaling.
cerebras.net
In a single keystroke, Cerebras can scale large language models from a single CS-2 system to 192 CS-2s in a Cerebras Wafer-Scale Cluster.
1
1
10