Katie Everett @_katieeverett X Profile

Katie Everett

@_katieeverett

Followers

3K

Following

98

Media

38

Statuses

86

Machine learning researcher @GoogleDeepMind + PhD student @MIT. Opinions are my own.

Joined August 2013

Don't wanna be here? Send us removal request.

Katie Everett

@_katieeverett

5 months

@damien_ferbach @cypaquette @poseypaquet @gauthier_gidel If you enjoyed this thread, see Part 2 here: https://t.co/knLIppwL4R

Katie Everett

@_katieeverett

5 months

There were so many great replies to this thread, let's do a Part 2! For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)? https://t.co/RHJ0C2PlOI

0

5

Damien Ferbach

@damien_ferbach

5 months

It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer! Our new paper shows that scaling momentum correctly can *provably* improve the scaling exponent on a theoretical model. Empirically, it works on LSTMs too!

11

63

312

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel More arxiv refs: Pagnoni et al 2024: 2412.09871 Snell et al: 2408.03314 Brown & Juravsky et al 2024: 2407.21787 Schaeffer et al. 2025: 2502.17578

1

10

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer @damien_ferbach @cypaquette @poseypaquet @gauthier_gidel Arxiv refs: Sharma & Kaplan 2020: 2004.10802 Henighan et al 2020: 2010.14701 Bansal et al 2022: 2202.01994 Wang et al 2024: 2410.05661 Shukor et al 2025: 2504.07951 Krajewski et al 2024: 2402.07871 Bordelon & Atanasov et al 2024: 2409.17858 Liu et al 2025: 2502.16982

1

11

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Thanks again for all the great responses to the first thread, and I hope you are following @damien_ferbach! 😁 cc @cypaquette @poseypaquet @gauthier_gidel

1

2

13

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer In summary: * Data affects the exponent! * Feature learning can improve the exponent over kernel regimes in some cases * So far we see similar exponents for Muon vs Adam and MoE vs dense models * Scaling inference-time compute opens new scaling dimensions that seem promising!

1

2

20

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Schaeffer et al. 2025 scales the number of attempts per task. They show the distribution over single-attempt success rates predicts the power law exponent for success vs number of attempts. Task difficulty affects the scaling exponent again!

1

13

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Snell et al 2024 asks how we should tradeoff pretraining vs test-time compute: the answer depends on the *difficulty of the task* as well as the ratio between training vs test-time compute load. Easier tasks favor more test-time compute but harder tasks favor more pretraining.

1

17

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 @ted_engineer Brown & Juravsky et al 2024 shows inference-time scaling laws using repeated sampling where: log(coverage) = a * (num samples ^ b). Coverage is the fraction of problems solved by any generated sample. The exponent b on the number of samples depends on both the task and model.

1

13

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 But if we ask "what's the best bits-per-byte given X FLOPs to train *and* Y FLOPs per byte at inference?" Then Byte Latent Transformer has an advantage: it can scale the patch size to deploy larger models for the same inference cost. h/t @ted_engineer https://t.co/6bZ7Vh4E4W

Ted - 🥖/acc

@ted_engineer

5 months

@_katieeverett Meta claims their raw bytes architecture BLT ( https://t.co/e8wNLvioBY) has better scaling laws than classical bpe +transformer recipe

1

16

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Pagnoni et al 2024: Byte Latent Transformer groups bytes dynamically into patches instead of tokenizing w/ fixed vocab. For hard predictions, use smaller patches = more inference compute. Exponents look similar when asking "what's the best bits-per-byte given X FLOPs to train?"

1

15

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov @SeunghyunSEO7 Finally, there is lots of interesting recent work using variable amounts of inference-time compute! It's more nuanced to define what "improving the scaling exponent" means here, but let's look at some promising ideas:

1

12

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 @ABAtanasov Liu et al 2025 scales up Muon and finds similar scaling exponents between Muon (-0.052) and AdamW (-0.054). h/t @SeunghyunSEO7 https://t.co/djISE5TaGw

Seunghyun Seo

@SeunghyunSEO7

5 months

even muon shows similar exponent 2.506*C^-0.052 (muon) vs 2.608*C^-0.054 (adamw) (source from moonlight, https://t.co/aljpAecW7I)

1

23

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 Now onto optimizers: Bordelon & Atanasov et al 2024 solve a theoretical model showing feature learning can improve the scaling exponent, specifically for tasks where the target function lies outside the RKHS of the initial kernel. h/t @ABAtanasov https://t.co/QEELmCyUwE

Alex Atanasov

@ABAtanasov

5 months

@_katieeverett This is a beautiful and thorough review. I will say that although it’s hard to make the power law better you can definitely make it *worse* by putting the network into the lazy/NTK regime. At least in that setting, though, we can see how different datasets give different power

2

1

19

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY @MustafaShukor1 Krajewski et al 2024 proposes compute-optimal scaling of tokens, params, and granularity for MoEs, where granularity is a hyperparameter controlling the size of the experts. It looks like granularity might also affect the constant but not the exponent?

1

17

Katie Everett

@_katieeverett

5 months

@chopwatercarry @EIFY Shukor et al 2025 compares MoE and Dense Transformers on multimodal models. They see similar exponents between MoE and Dense models. They again find that changing the data mixture affects the exponent. h/t @MustafaShukor1 https://t.co/ccCfTN1sId

Mustafa Shukor

@MustafaShukor1

5 months

@_katieeverett Interesting! We found similar findings for multimodal models. The architecture have more effect on shifting the scaling law curve (MoEs vs Dense) rather than the exponent. On the other hand, changing the data mixture (same architecture), can affect both. https://t.co/ROCnNJw9dj

1

2

20

Katie Everett

@_katieeverett

5 months

@chopwatercarry For MoEs vs Dense Transformers: Wang et al 2024 compares MoE and Dense Transformer language models, and shows a very similar exponent for both architectures. h/t @EIFY https://t.co/14lmlkEpG3

EIFY

@EIFY

5 months

@_katieeverett Apparently MoE vs. Dense transformer also seem to differ only in the offset... https://t.co/DEAZIErqKs

1

16

Katie Everett

@_katieeverett

5 months

@chopwatercarry Bansal et al 2022 compares the data scaling exponents on translation tasks and find the same exponent across different architectures, data filtering techniques, and synthetic i.i.d. noise. Adding non-i.i.d. noise via data augmentation (back-translation) does change the exponent.

1

2

13

Katie Everett

@_katieeverett

5 months

@chopwatercarry Henighan et al 2020 shows empirically that different modalities (language, image, video, math) have different exponents. Same for different image resolutions (8x8 vs 16x16 etc). (They also find that the exponent for optimal model size vs compute is universal across modalities.)

1

2

14

Katie Everett

@_katieeverett

5 months

On data: Sharma & Kaplan 2020 proposes a theoretical model where data distribution and task induce the data manifold dimensionality, which in turn induces the scaling exponent. Can explain why different modality = different exponent. h/t @chopwatercarry https://t.co/r8llfq0ldj

chopwatercarry

@chopwatercarry

5 months

@_katieeverett For model size, Sharma and Kaplan proposed that data distribution and task induces manifold with intrinsic dimensionality that determines exponent for scaling of test loss with model size.

1

2

15