Jeremy Cohen
@deepcohen
Followers
6K
Following
1K
Media
106
Statuses
1K
Research fellow at Flatiron Institute, working on understanding optimization in deep learning. Previously: PhD in machine learning at Carnegie Mellon.
New York, NY
Joined September 2011
Part 1: How does gradient descent work? https://t.co/avsScLLuDF Part 2: A simple adaptive optimizer https://t.co/KehSb1Wu20 Part 3: How does RMSProp work? https://t.co/t2Cqe67f1M
centralflows.github.io
1
8
106
Applying to do a postdoc or PhD in theoretical ML or neuroscience this year? Consider joining my group (starting next Fall) at UT Austin! POD Postdoc: https://t.co/CmaL3L0B6J CSEM PhD: https://t.co/TdwuZFBgEY
6
66
266
Need H100 power without multi‑year commitments? Rent lightning fast NVIDIA H100's for only $1.99/hr! Spin up clusters in minutes with Voltage Park On-Demand. Scale to thousands, billed by the second.
10
33
209
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
11
48
332
Diffusion models learn probability densities by estimating the score with a neural network trained to denoise. What kind of representation arises within these networks, and how does this relate to the learned density? @EeroSimoncelli @StephaneMallat and I explored this question.
14
92
524
I'm hiring 2 PhD students & 1 postdoc @GeorgiaTech for Fall'26 Motivated students plz consider us, especially those in * ML+Quantum * DeepLearning+Optimization -PhD: see https://t.co/h4anjm6b8j -Postdoc: see https://t.co/548XVaahx3 & https://t.co/4ahNE7OOwV Retweet appreciated
9
120
468
Don’t wait to learn if they’re keeping secrets. You won’t believe what you could find on TruthFinder. Try a search today.
0
5
94
So, why would higher curvature lead to high quantization error? Well, there's a very old idea that low-curvature solutions should be specifiable using less precision. This paper's observation seems to align squarely with that theory. https://t.co/hoh4QahPFI
0
1
8
If that sounds vague to you, we wrote the central flows paper precisely to make this picture quantitatively precise for deterministic training: https://t.co/hlTxjcictN. Similar effects happen during stochastic training, but we don't yet have the theory to make it precise.
arxiv.org
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically...
1
1
3
For example, for deterministic gradient descent, the oscillatory dynamics implicitly keep the top Hessian eigenvalues regulated at the value 2/LR. If you cut LR, this constraint is lifted, and the top eigenvalues rise, as show in Figure 6 of our '21 paper: https://t.co/ohSBjvhFIc
1
1
2
That LR decay leads to curvature growth is well-known to everyone who has experimentally studied this topic. In deep learning, the curvature always "wants" to increase, but the large LR's used in practice induce noisy/oscillatory dynamics that keep it held down.
1
1
3
Insights from Condoleezza Rice, Victor Davis Hanson & more. Hoover's top minds tackle policy, security & the economy.
0
43
249
The authors didn't measure the curvature, but if they had (say, Hessian top eigenvalue or trace), I'd predict this value would rise exactly when the learning rate is decayed. Further, I'd predict the causal mechanism is: LR decay -> curvature growth -> quantization error
2
1
8
This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics.
🚨 Quantization robustness isn’t just post-hoc engineering; it’s a training-time property. Our new paper studies the role of training dynamics in quantization robustness. More in 🧵👇
4
10
110
Watch @alex_damian_ give a talk about this paper here: https://t.co/AXk0tzCeix
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
0
19
185
This is why hybrid math/experiment studies like this one play a necessary role in the broader research ecosystem -- someone needs to go out far ahead of rigorous theory and find analytical tricks that do work even though they have no right to (to quote Jamie Simon).
1
1
12
Inflation, productivity, and r* matter — but they’re only part of the picture. Our GeoMacro strategy starts with politics and policy, revealing the forces that shape markets long before the data. Download our sample report and see how we approach GeoMacro.
0
5
48
We don't know how to make this idea rigorous, to the usual rigor standards of the field of optimization. But our experiments imply that this is absolutely the right way to think about these dynamics (i.e. the dynamics of gradient descent and RMSProp in deep learning).
1
0
2
Thanks Mufan! Our approach is definitely "weird": we are studying a completely deterministic, yet chaotic, process, and our approach is to treat as if it's almost a random variable.
I'm a huge fan of this work. I think it's a really brilliant idea to model the chaotic dynamics of edge of stability as a random variable, and observing that you have enough information to just solve for the covariance. Deserves the best paper award imo.
2
0
21
@thegautamkamath I’ve seen Alex Damian give a talk on this paper, and it was imo the best paper I’ve seen all year.
0
1
10
@jasondeanlee @SebastienBubeck @tomgoldsteincs @zicokolter @atalwalkar This is the third, last, and best paper from my PhD. By some metrics, an ML PhD student who writes just three conference papers is "unproductive." But I wouldn't have had it any other way 😉 !
11
20
537
Get more replies from the same leads you're getting now with Linq Blue. Send blue iMessages directly form your CRM & get your sales team dialed in.
0
0
5
@jasondeanlee @SebastienBubeck @tomgoldsteincs Finally, thanks to my wonderful advisors @zicokolter and @atalwalkar for providing the single most important resource for doing research: time.
2
0
81
Thanks to @jasondeanlee for some crucial SDP wizardry! Jason also served on my thesis committee, along with @SebastienBubeck and @tomgoldsteincs. Thank you all for that!
1
0
37
I'd like to thank Alex for being a perfect, complementary collaborator for me. If this work is good, it's because of the synergy between his skill-set and mine. He's joining MIT next year as an assistant professor in the Math and EECS departments (!) Apply to work with him!
1
2
67