
Katie Everett
@_katieeverett
Followers
3K
Following
97
Media
38
Statuses
86
Machine learning researcher @GoogleDeepMind + PhD student @MIT. Opinions are my own.
Joined August 2013
@damien_ferbach @cypaquette @poseypaquet @gauthier_gidel If you enjoyed this thread, see Part 2 here:.
There were so many great replies to this thread, let's do a Part 2!. For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)?.
0
0
4
RT @damien_ferbach: It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer….
0
60
0
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel More arxiv refs:.Pagnoni et al 2024: 2412.09871.Snell et al: 2408.03314.Brown & Juravsky et al 2024: 2407.21787.Schaeffer et al. 2025: 2502.17578.
1
1
10
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel Arxiv refs:.Sharma & Kaplan 2020: 2004.10802.Henighan et al 2020: 2010.14701.Bansal et al 2022: 2202.01994. Wang et al 2024: 2410.05661.Shukor et al 2025: 2504.07951.Krajewski et al 2024: 2402.07871. Bordelon & Atanasov et al 2024: 2409.17858.Liu et al 2025: 2502.16982.
1
1
11
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Thanks again for all the great responses to the first thread, and I hope you are following @damien_ferbach! 😁. cc @cypaquette @poseypaquet @gauthier_gidel.
1
2
13
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer In summary:.* Data affects the exponent!.* Feature learning can improve the exponent over kernel regimes in some cases.* So far we see similar exponents for Muon vs Adam and MoE vs dense models.* Scaling inference-time compute opens new scaling dimensions that seem promising!.
1
2
20
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Schaeffer et al. 2025 scales the number of attempts per task. They show the distribution over single-attempt success rates predicts the power law exponent for success vs number of attempts. Task difficulty affects the scaling exponent again!
1
1
13
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Snell et al 2024 asks how we should tradeoff pretraining vs test-time compute: the answer depends on the *difficulty of the task* as well as the ratio between training vs test-time compute load. Easier tasks favor more test-time compute but harder tasks favor more pretraining.
1
1
17
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Brown & Juravsky et al 2024 shows inference-time scaling laws using repeated sampling where: log(coverage) = a * (num samples ^ b). Coverage is the fraction of problems solved by any generated sample. The exponent b on the number of samples depends on both the task and model.
1
1
13
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 But if we ask "what's the best bits-per-byte given X FLOPs to train *and* Y FLOPs per byte at inference?". Then Byte Latent Transformer has an advantage: it can scale the patch size to deploy larger models for the same inference cost. h/t @ted_engineer.
@_katieeverett Meta claims their raw bytes architecture BLT ( has better scaling laws than classical bpe +transformer recipe
1
1
15
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Pagnoni et al 2024: Byte Latent Transformer groups bytes dynamically into patches instead of tokenizing w/ fixed vocab. For hard predictions, use smaller patches = more inference compute. Exponents look similar when asking "what's the best bits-per-byte given X FLOPs to train?"
1
1
14
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Finally, there is lots of interesting recent work using variable amounts of inference-time compute! It's more nuanced to define what "improving the scaling exponent" means here, but let's look at some promising ideas:.
1
1
12
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov Liu et al 2025 scales up Muon and finds similar scaling exponents between Muon (-0.052) and AdamW (-0.054). h/t @SeunghyunSEO7.
even muon shows similar exponent 2.506*C^-0.052 (muon) vs 2.608*C^-0.054 (adamw).(source from moonlight,
1
1
17
@chopwatercarry @EIFY @MustafaShukor1 Now onto optimizers:. Bordelon & Atanasov et al 2024 solve a theoretical model showing feature learning can improve the scaling exponent, specifically for tasks where the target function lies outside the RKHS of the initial kernel. h/t @ABAtanasov.
@_katieeverett This is a beautiful and thorough review. I will say that although it’s hard to make the power law better you can definitely make it *worse* by putting the network into the lazy/NTK regime. At least in that setting, though, we can see how different datasets give different power.
2
1
19
@chopwatercarry @EIFY @MustafaShukor1 Krajewski et al 2024 proposes compute-optimal scaling of tokens, params, and granularity for MoEs, where granularity is a hyperparameter controlling the size of the experts. It looks like granularity might also affect the constant but not the exponent?
1
1
17
@chopwatercarry @EIFY Shukor et al 2025 compares MoE and Dense Transformers on multimodal models. They see similar exponents between MoE and Dense models. They again find that changing the data mixture affects the exponent. h/t @MustafaShukor1.
@_katieeverett Interesting! We found similar findings for multimodal models. The architecture have more effect on shifting the scaling law curve (MoEs vs Dense) rather than the exponent. On the other hand, changing the data mixture (same architecture), can affect both.
1
2
19
@chopwatercarry For MoEs vs Dense Transformers:. Wang et al 2024 compares MoE and Dense Transformer language models, and shows a very similar exponent for both architectures. h/t @EIFY.
1
1
15
@chopwatercarry Bansal et al 2022 compares the data scaling exponents on translation tasks and find the same exponent across different architectures, data filtering techniques, and synthetic i.i.d. noise. Adding non-i.i.d. noise via data augmentation (back-translation) does change the exponent.
1
2
13
@chopwatercarry Henighan et al 2020 shows empirically that different modalities (language, image, video, math) have different exponents. Same for different image resolutions (8x8 vs 16x16 etc). (They also find that the exponent for optimal model size vs compute is universal across modalities.)
1
2
14
On data:. Sharma & Kaplan 2020 proposes a theoretical model where data distribution and task induce the data manifold dimensionality, which in turn induces the scaling exponent. Can explain why different modality = different exponent. h/t @chopwatercarry.
@_katieeverett For model size, Sharma and Kaplan proposed that data distribution and task induces manifold with intrinsic dimensionality that determines exponent for scaling of test loss with model size.
1
2
15