deepcohen Profile Banner
Jeremy Cohen Profile
Jeremy Cohen

@deepcohen

Followers
6K
Following
1K
Media
106
Statuses
1K

Research fellow at Flatiron Institute, working on understanding optimization in deep learning. Previously: PhD in machine learning at Carnegie Mellon.

New York, NY
Joined September 2011
Don't wanna be here? Send us removal request.
@deepcohen
Jeremy Cohen
1 month
Part 1: How does gradient descent work? https://t.co/avsScLLuDF Part 2: A simple adaptive optimizer https://t.co/KehSb1Wu20 Part 3: How does RMSProp work? https://t.co/t2Cqe67f1M
Tweet card summary image
centralflows.github.io
1
8
106
@blake__bordelon
Blake Bordelon ☕️🧪👨‍💻
13 days
Applying to do a postdoc or PhD in theoretical ML or neuroscience this year? Consider joining my group (starting next Fall) at UT Austin! POD Postdoc: https://t.co/CmaL3L0B6J CSEM PhD: https://t.co/TdwuZFBgEY
6
66
266
@VoltagePark
Voltage Park
4 months
Need H100 power without multi‑year commitments? Rent lightning fast NVIDIA H100's for only $1.99/hr! Spin up clusters in minutes with Voltage  Park On-Demand. Scale to thousands, billed by the second.
10
33
209
@AtliKosson
Atli Kosson
13 days
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
11
48
332
@ZKadkhodaie
Zahra Kadkhodaie
20 days
Diffusion models learn probability densities by estimating the score with a neural network trained to denoise. What kind of representation arises within these networks, and how does this relate to the learned density? @EeroSimoncelli @StephaneMallat and I explored this question.
14
92
524
@MoleiTaoMath
Molei Tao
22 days
I'm hiring 2 PhD students & 1 postdoc @GeorgiaTech for Fall'26 Motivated students plz consider us, especially those in * ML+Quantum * DeepLearning+Optimization -PhD: see https://t.co/h4anjm6b8j -Postdoc: see https://t.co/548XVaahx3 & https://t.co/4ahNE7OOwV Retweet appreciated
9
120
468
@TruthFinder
TruthFinder
29 days
Don’t wait to learn if they’re keeping secrets. You won’t believe what you could find on TruthFinder. Try a search today.
0
5
94
@deepcohen
Jeremy Cohen
24 days
So, why would higher curvature lead to high quantization error? Well, there's a very old idea that low-curvature solutions should be specifiable using less precision. This paper's observation seems to align squarely with that theory. https://t.co/hoh4QahPFI
0
1
8
@deepcohen
Jeremy Cohen
24 days
If that sounds vague to you, we wrote the central flows paper precisely to make this picture quantitatively precise for deterministic training: https://t.co/hlTxjcictN. Similar effects happen during stochastic training, but we don't yet have the theory to make it precise.
Tweet card summary image
arxiv.org
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically...
1
1
3
@deepcohen
Jeremy Cohen
24 days
For example, for deterministic gradient descent, the oscillatory dynamics implicitly keep the top Hessian eigenvalues regulated at the value 2/LR. If you cut LR, this constraint is lifted, and the top eigenvalues rise, as show in Figure 6 of our '21 paper: https://t.co/ohSBjvhFIc
1
1
2
@deepcohen
Jeremy Cohen
24 days
That LR decay leads to curvature growth is well-known to everyone who has experimentally studied this topic. In deep learning, the curvature always "wants" to increase, but the large LR's used in practice induce noisy/oscillatory dynamics that keep it held down.
1
1
3
@HooverInst
Hoover Institution
14 days
Insights from Condoleezza Rice, Victor Davis Hanson & more. Hoover's top minds tackle policy, security & the economy.
0
43
249
@deepcohen
Jeremy Cohen
24 days
The authors didn't measure the curvature, but if they had (say, Hessian top eigenvalue or trace), I'd predict this value would rise exactly when the learning rate is decayed. Further, I'd predict the causal mechanism is: LR decay -> curvature growth -> quantization error
2
1
8
@deepcohen
Jeremy Cohen
24 days
This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics.
@actatjer
Albert Catalán Tatjer
28 days
🚨 Quantization robustness isn’t just post-hoc engineering; it’s a training-time property. Our new paper studies the role of training dynamics in quantization robustness. More in 🧵👇
4
10
110
@deepcohen
Jeremy Cohen
25 days
Watch @alex_damian_ give a talk about this paper here: https://t.co/AXk0tzCeix
@deepcohen
Jeremy Cohen
1 month
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
0
19
185
@deepcohen
Jeremy Cohen
1 month
This is why hybrid math/experiment studies like this one play a necessary role in the broader research ecosystem -- someone needs to go out far ahead of rigorous theory and find analytical tricks that do work even though they have no right to (to quote Jamie Simon).
1
1
12
@bcaresearch
BCA Research
2 months
Inflation, productivity, and r* matter — but they’re only part of the picture. Our GeoMacro strategy starts with politics and policy, revealing the forces that shape markets long before the data. Download our sample report and see how we approach GeoMacro.
0
5
48
@deepcohen
Jeremy Cohen
1 month
We don't know how to make this idea rigorous, to the usual rigor standards of the field of optimization. But our experiments imply that this is absolutely the right way to think about these dynamics (i.e. the dynamics of gradient descent and RMSProp in deep learning).
1
0
2
@deepcohen
Jeremy Cohen
1 month
Thanks Mufan! Our approach is definitely "weird": we are studying a completely deterministic, yet chaotic, process, and our approach is to treat as if it's almost a random variable.
@mufan_li
Mufan Li
1 month
I'm a huge fan of this work. I think it's a really brilliant idea to model the chaotic dynamics of edge of stability as a random variable, and observing that you have enough information to just solve for the covariance. Deserves the best paper award imo.
2
0
21
@mufan_li
Mufan Li
1 month
@thegautamkamath I’ve seen Alex Damian give a talk on this paper, and it was imo the best paper I’ve seen all year.
0
1
10
@deepcohen
Jeremy Cohen
1 month
@jasondeanlee @SebastienBubeck @tomgoldsteincs @zicokolter @atalwalkar This is the third, last, and best paper from my PhD. By some metrics, an ML PhD student who writes just three conference papers is "unproductive." But I wouldn't have had it any other way 😉 !
11
20
537
@thelinqapp
Linq
1 month
Get more replies from the same leads you're getting now with Linq Blue. Send blue iMessages directly form your CRM & get your sales team dialed in.
0
0
5
@deepcohen
Jeremy Cohen
1 month
@jasondeanlee @SebastienBubeck @tomgoldsteincs Finally, thanks to my wonderful advisors @zicokolter and @atalwalkar for providing the single most important resource for doing research: time.
2
0
81
@deepcohen
Jeremy Cohen
1 month
Thanks to @jasondeanlee for some crucial SDP wizardry! Jason also served on my thesis committee, along with @SebastienBubeck and @tomgoldsteincs. Thank you all for that!
1
0
37
@deepcohen
Jeremy Cohen
1 month
I'd like to thank Alex for being a perfect, complementary collaborator for me. If this work is good, it's because of the synergy between his skill-set and mine. He's joining MIT next year as an assistant professor in the Math and EECS departments (!) Apply to work with him!
1
2
67