Katie Everett Profile
Katie Everett

@_katieeverett

Followers
3K
Following
97
Media
38
Statuses
86

Machine learning researcher @GoogleDeepMind + PhD student @MIT. Opinions are my own.

Joined August 2013
Don't wanna be here? Send us removal request.
@_katieeverett
Katie Everett
2 months
@damien_ferbach @cypaquette @poseypaquet @gauthier_gidel If you enjoyed this thread, see Part 2 here:.
@_katieeverett
Katie Everett
2 months
There were so many great replies to this thread, let's do a Part 2!. For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)?.
0
0
4
@_katieeverett
Katie Everett
2 months
RT @damien_ferbach: It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer….
0
60
0
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel More arxiv refs:.Pagnoni et al 2024: 2412.09871.Snell et al: 2408.03314.Brown & Juravsky et al 2024: 2407.21787.Schaeffer et al. 2025: 2502.17578.
1
1
10
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel Arxiv refs:.Sharma & Kaplan 2020: 2004.10802.Henighan et al 2020: 2010.14701.Bansal et al 2022: 2202.01994. Wang et al 2024: 2410.05661.Shukor et al 2025: 2504.07951.Krajewski et al 2024: 2402.07871. Bordelon & Atanasov et al 2024: 2409.17858.Liu et al 2025: 2502.16982.
1
1
11
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Thanks again for all the great responses to the first thread, and I hope you are following @damien_ferbach! 😁. cc @cypaquette @poseypaquet @gauthier_gidel.
1
2
13
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer In summary:.* Data affects the exponent!.* Feature learning can improve the exponent over kernel regimes in some cases.* So far we see similar exponents for Muon vs Adam and MoE vs dense models.* Scaling inference-time compute opens new scaling dimensions that seem promising!.
1
2
20
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Schaeffer et al. 2025 scales the number of attempts per task. They show the distribution over single-attempt success rates predicts the power law exponent for success vs number of attempts. Task difficulty affects the scaling exponent again!
Tweet media one
1
1
13
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Snell et al 2024 asks how we should tradeoff pretraining vs test-time compute: the answer depends on the *difficulty of the task* as well as the ratio between training vs test-time compute load. Easier tasks favor more test-time compute but harder tasks favor more pretraining.
Tweet media one
1
1
17
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Brown & Juravsky et al 2024 shows inference-time scaling laws using repeated sampling where: log(coverage) = a * (num samples ^ b). Coverage is the fraction of problems solved by any generated sample. The exponent b on the number of samples depends on both the task and model.
Tweet media one
1
1
13
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 But if we ask "what's the best bits-per-byte given X FLOPs to train *and* Y FLOPs per byte at inference?". Then Byte Latent Transformer has an advantage: it can scale the patch size to deploy larger models for the same inference cost. h/t @ted_engineer.
Tweet media one
@ted_engineer
Ted - 🥖/acc
2 months
@_katieeverett Meta claims their raw bytes architecture BLT ( has better scaling laws than classical bpe +transformer recipe
Tweet media one
1
1
15
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Pagnoni et al 2024: Byte Latent Transformer groups bytes dynamically into patches instead of tokenizing w/ fixed vocab. For hard predictions, use smaller patches = more inference compute. Exponents look similar when asking "what's the best bits-per-byte given X FLOPs to train?"
Tweet media one
1
1
14
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Finally, there is lots of interesting recent work using variable amounts of inference-time compute! It's more nuanced to define what "improving the scaling exponent" means here, but let's look at some promising ideas:.
1
1
12
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov Liu et al 2025 scales up Muon and finds similar scaling exponents between Muon (-0.052) and AdamW (-0.054). h/t @SeunghyunSEO7.
Tweet media one
Tweet media two
@SeunghyunSEO7
Seunghyun Seo
2 months
even muon shows similar exponent 2.506*C^-0.052 (muon) vs 2.608*C^-0.054 (adamw).(source from moonlight,
Tweet media one
1
1
17
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 Now onto optimizers:. Bordelon & Atanasov et al 2024 solve a theoretical model showing feature learning can improve the scaling exponent, specifically for tasks where the target function lies outside the RKHS of the initial kernel. h/t @ABAtanasov.
Tweet media one
Tweet media two
@ABAtanasov
Alex Atanasov
2 months
@_katieeverett This is a beautiful and thorough review. I will say that although it’s hard to make the power law better you can definitely make it *worse* by putting the network into the lazy/NTK regime. At least in that setting, though, we can see how different datasets give different power.
2
1
19
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY @MustafaShukor1 Krajewski et al 2024 proposes compute-optimal scaling of tokens, params, and granularity for MoEs, where granularity is a hyperparameter controlling the size of the experts. It looks like granularity might also affect the constant but not the exponent?
Tweet media one
1
1
17
@_katieeverett
Katie Everett
2 months
@chopwatercarry @EIFY Shukor et al 2025 compares MoE and Dense Transformers on multimodal models. They see similar exponents between MoE and Dense models. They again find that changing the data mixture affects the exponent. h/t @MustafaShukor1.
Tweet media one
Tweet media two
@MustafaShukor1
Mustafa Shukor
2 months
@_katieeverett Interesting! We found similar findings for multimodal models. The architecture have more effect on shifting the scaling law curve (MoEs vs Dense) rather than the exponent. On the other hand, changing the data mixture (same architecture), can affect both.
Tweet media one
Tweet media two
1
2
19
@_katieeverett
Katie Everett
2 months
@chopwatercarry For MoEs vs Dense Transformers:. Wang et al 2024 compares MoE and Dense Transformer language models, and shows a very similar exponent for both architectures. h/t @EIFY.
Tweet media one
@EIFY
EIFY
2 months
@_katieeverett Apparently MoE vs. Dense transformer also seem to differ only in the offset.
Tweet media one
1
1
15
@_katieeverett
Katie Everett
2 months
@chopwatercarry Bansal et al 2022 compares the data scaling exponents on translation tasks and find the same exponent across different architectures, data filtering techniques, and synthetic i.i.d. noise. Adding non-i.i.d. noise via data augmentation (back-translation) does change the exponent.
Tweet media one
Tweet media two
Tweet media three
1
2
13
@_katieeverett
Katie Everett
2 months
@chopwatercarry Henighan et al 2020 shows empirically that different modalities (language, image, video, math) have different exponents. Same for different image resolutions (8x8 vs 16x16 etc). (They also find that the exponent for optimal model size vs compute is universal across modalities.)
Tweet media one
Tweet media two
Tweet media three
1
2
14
@_katieeverett
Katie Everett
2 months
On data:. Sharma & Kaplan 2020 proposes a theoretical model where data distribution and task induce the data manifold dimensionality, which in turn induces the scaling exponent. Can explain why different modality = different exponent. h/t @chopwatercarry.
Tweet media one
@chopwatercarry
chopwatercarry
2 months
@_katieeverett For model size, Sharma and Kaplan proposed that data distribution and task induces manifold with intrinsic dimensionality that determines exponent for scaling of test loss with model size.
1
2
15