 
            
              Shane Bergsma
            
            @ShaneBergsma
Followers
                302
              Following
                282
              Media
                9
              Statuses
                143
              Man bites data
              
              Toronto, Ontario
            
            
              
              Joined February 2012
            
            
           Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it.  https://t.co/C8X1C3hUWL 
          
          
                
                1
              
              
                
                1
              
              
                
                7
              
             The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵 
          
                
                11
              
              
                
                48
              
              
                
                329
              
             Fig 1: • Valley in TREC ≠ train-loss minimum → best spot for HQ data • Shape tracks AdamW τ (via weight decay) • Curves align across 1000× scaling at fixed τ (and TPP) 
          
                
                0
              
              
                
                0
              
              
                
                0
              
             Beautiful work on pretraining science using scaling collapse to precisely predict, debug, and tune LLM training from small-scale and partial runs. So much insights on going beyond μP! 
           (1/4) @CerebrasSystems Hot off the presses 🔥📄  https://t.co/ahPvKCFN9g  If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training. 
            
                
                2
              
              
                
                2
              
              
                
                12
              
            
            @ShikaiQiu @Locchiu @andrewgwils @xidulu @laurence_ai (4/4) Our Power Lines (NeurIPS 2025) showed τ* is set by TPP. 👉 Fixed TPP + optimal τ ⇒collapse emerges naturally. With Claire Zhang, @DeyNolan, Shaheer Muhammad, @gurpreetgosal_, Joel Hestness we trained Celerity on this recipe…it does collapse, and sits on compute frontier.
          
          
                
                0
              
              
                
                0
              
              
                
                2
              
            
            @ShikaiQiu @Locchiu @andrewgwils (3/4) Collapse in LLMs needs 3 aligned controls: • same LR schedule • same TPP • same optimizer timescale τ (@xidulu @laurence_ai) Sweep B, λ, or η → same τ ⇒ same curve. Sweep τ itself → curves peel apart.
          
          
                
                1
              
              
                
                1
              
              
                
                3
              
             (2/4) Earlier work by @ShikaiQiu @Locchiu @andrewgwils J. Pennington & A. Agarwala showed collapse at small scale and called for testing it in full LLM-scale ladders. ✅ We did it. 
          
                
                1
              
              
                
                0
              
              
                
                3
              
             (1/4) @CerebrasSystems Hot off the presses 🔥📄  https://t.co/ahPvKCFN9g  If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training. 
          
                
                2
              
              
                
                8
              
              
                
                25
              
             Power Lines paper now out:  https://t.co/AwAgxyM735  TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size. 
          
            
            arxiv.org
              Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we...
            
                
                1
              
              
                
                19
              
              
                
                96
              
             (1/7) @CerebrasSystems Paper drop:  https://t.co/dCATF7nMCp  TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇 
          
                
                12
              
              
                
                67
              
              
                
                409
              
             It’s #ICLR2025 week, and we’re proud to share that Team Cerebras will be presenting their paper: "Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs" at @iclr_conf! Big congrats to the authors, your work is powering the future of AI compute. 
          
                
                2
              
              
                
                5
              
              
                
                32
              
             Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s. @CerebrasSystems has just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips. Cerebras Inference is 
          
                
                12
              
              
                
                67
              
              
                
                308
              
             My son, after reading half the books: "J.R.R. Tolkien is a man? I had no idea." Than you, @jk_rowling
          
          
                
                0
              
              
                
                1
              
              
                
                8
              
             In an effort to foster a more cooperative spirit between different parts of my code, I no longer pass *arguments* to a function. Instead when one function calls another, it passes along some *gentle feedback*. 
          
                
                39
              
              
                
                515
              
              
                
                2K
              
             The whole group? Wow, this migration of academics to industry is getting out of control. 
          
                
                0
              
              
                
                9
              
              
                
                26
              
             Wikipedia (one of the supreme achievements of humanity) doesn't get enough love, so just let me say, "thank you, Wikipedia." 
          
                
                0
              
              
                
                0
              
              
                
                1
              
             
             
             
             
             
             
            