Jeremy Cohen @deepcohen X Profile

Jeremy Cohen

@deepcohen

Followers

6K

Following

1K

Media

106

Statuses

1K

Research fellow at Flatiron Institute, working on understanding optimization in deep learning. Previously: PhD in machine learning at Carnegie Mellon.

https://t.co/wwUtdYhhCd

New York, NY

Joined September 2011

Don't wanna be here? Send us removal request.

Jeremy Cohen

@deepcohen

1 month

Part 1: How does gradient descent work? https://t.co/avsScLLuDF Part 2: A simple adaptive optimizer https://t.co/KehSb1Wu20 Part 3: How does RMSProp work? https://t.co/t2Cqe67f1M

centralflows.github.io

1

8

106

Blake Bordelon ☕️🧪👨‍💻

@blake__bordelon

13 days

Applying to do a postdoc or PhD in theoretical ML or neuroscience this year? Consider joining my group (starting next Fall) at UT Austin! POD Postdoc: https://t.co/CmaL3L0B6J CSEM PhD: https://t.co/TdwuZFBgEY

6

66

266

Voltage Park

@VoltagePark

4 months

Need H100 power without multi‑year commitments? Rent lightning fast NVIDIA H100's for only $1.99/hr! Spin up clusters in minutes with Voltage  Park On-Demand. Scale to thousands, billed by the second.

10

33

209

Atli Kosson

@AtliKosson

13 days

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

11

48

332

Zahra Kadkhodaie

@ZKadkhodaie

20 days

Diffusion models learn probability densities by estimating the score with a neural network trained to denoise. What kind of representation arises within these networks, and how does this relate to the learned density? @EeroSimoncelli @StephaneMallat and I explored this question.

14

92

524

Molei Tao

@MoleiTaoMath

22 days

I'm hiring 2 PhD students & 1 postdoc @GeorgiaTech for Fall'26 Motivated students plz consider us, especially those in * ML+Quantum * DeepLearning+Optimization -PhD: see https://t.co/h4anjm6b8j -Postdoc: see https://t.co/548XVaahx3 & https://t.co/4ahNE7OOwV Retweet appreciated

9

120

468

TruthFinder

@TruthFinder

29 days

Don’t wait to learn if they’re keeping secrets. You won’t believe what you could find on TruthFinder. Try a search today.

0

5

94

Jeremy Cohen

@deepcohen

24 days

So, why would higher curvature lead to high quantization error? Well, there's a very old idea that low-curvature solutions should be specifiable using less precision. This paper's observation seems to align squarely with that theory. https://t.co/hoh4QahPFI

0

1

8

Jeremy Cohen

@deepcohen

24 days

If that sounds vague to you, we wrote the central flows paper precisely to make this picture quantitatively precise for deterministic training: https://t.co/hlTxjcictN. Similar effects happen during stochastic training, but we don't yet have the theory to make it precise.

arxiv.org

Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically...

1

3

Jeremy Cohen

@deepcohen

24 days

For example, for deterministic gradient descent, the oscillatory dynamics implicitly keep the top Hessian eigenvalues regulated at the value 2/LR. If you cut LR, this constraint is lifted, and the top eigenvalues rise, as show in Figure 6 of our '21 paper: https://t.co/ohSBjvhFIc

1

2

Jeremy Cohen

@deepcohen

24 days

That LR decay leads to curvature growth is well-known to everyone who has experimentally studied this topic. In deep learning, the curvature always "wants" to increase, but the large LR's used in practice induce noisy/oscillatory dynamics that keep it held down.

1

3

Hoover Institution

@HooverInst

14 days

Insights from Condoleezza Rice, Victor Davis Hanson & more. Hoover's top minds tackle policy, security & the economy.

0

43

249

Jeremy Cohen

@deepcohen

24 days

The authors didn't measure the curvature, but if they had (say, Hessian top eigenvalue or trace), I'd predict this value would rise exactly when the learning rate is decayed. Further, I'd predict the causal mechanism is: LR decay -> curvature growth -> quantization error

2

1

8

Jeremy Cohen

@deepcohen

24 days

This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics.

Albert Catalán Tatjer

@actatjer

28 days

🚨 Quantization robustness isn’t just post-hoc engineering; it’s a training-time property. Our new paper studies the role of training dynamics in quantization robustness. More in 🧵👇

4

10

110

Jeremy Cohen

@deepcohen

25 days

Watch @alex_damian_ give a talk about this paper here: https://t.co/AXk0tzCeix

Jeremy Cohen

@deepcohen

1 month

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

0

19

185

Jeremy Cohen

@deepcohen

1 month

This is why hybrid math/experiment studies like this one play a necessary role in the broader research ecosystem -- someone needs to go out far ahead of rigorous theory and find analytical tricks that do work even though they have no right to (to quote Jamie Simon).

1

12

BCA Research

@bcaresearch

2 months

Inflation, productivity, and r* matter — but they’re only part of the picture. Our GeoMacro strategy starts with politics and policy, revealing the forces that shape markets long before the data. Download our sample report and see how we approach GeoMacro.

0

5

48

Jeremy Cohen

@deepcohen

1 month

We don't know how to make this idea rigorous, to the usual rigor standards of the field of optimization. But our experiments imply that this is absolutely the right way to think about these dynamics (i.e. the dynamics of gradient descent and RMSProp in deep learning).

1

0

2

Jeremy Cohen

@deepcohen

1 month

Thanks Mufan! Our approach is definitely "weird": we are studying a completely deterministic, yet chaotic, process, and our approach is to treat as if it's almost a random variable.

Mufan Li

@mufan_li

1 month

I'm a huge fan of this work. I think it's a really brilliant idea to model the chaotic dynamics of edge of stability as a random variable, and observing that you have enough information to just solve for the covariance. Deserves the best paper award imo.

2

0

21

Mufan Li

@mufan_li

1 month

@thegautamkamath I’ve seen Alex Damian give a talk on this paper, and it was imo the best paper I’ve seen all year.

0

1

10

Jeremy Cohen

@deepcohen

1 month

@jasondeanlee @SebastienBubeck @tomgoldsteincs @zicokolter @atalwalkar This is the third, last, and best paper from my PhD. By some metrics, an ML PhD student who writes just three conference papers is "unproductive." But I wouldn't have had it any other way 😉 !

11

20

537

Linq

@thelinqapp

1 month

Get more replies from the same leads you're getting now with Linq Blue. Send blue iMessages directly form your CRM & get your sales team dialed in.

0

5

Jeremy Cohen

@deepcohen

1 month

@jasondeanlee @SebastienBubeck @tomgoldsteincs Finally, thanks to my wonderful advisors @zicokolter and @atalwalkar for providing the single most important resource for doing research: time.

2

0

81

Jeremy Cohen

@deepcohen

1 month

Thanks to @jasondeanlee for some crucial SDP wizardry! Jason also served on my thesis committee, along with @SebastienBubeck and @tomgoldsteincs. Thank you all for that!

1

0

37

Jeremy Cohen

@deepcohen

1 month

I'd like to thank Alex for being a perfect, complementary collaborator for me. If this work is good, it's because of the synergy between his skill-set and mine. He's joining MIT next year as an assistant professor in the Math and EECS departments (!) Apply to work with him!

1

2

67