Alex Damian
@alex_damian_
Followers
590
Following
106
Media
2
Statuses
15
Research Fellow @KempnerInst | Incoming Asst Prof @MIT Math+EECS (Fall 2026) | Deep Learning Theory
Cambridge, MA
Joined March 2021
Why do deep learning optimizers make progress even in the edge-of-stability regime? 🤔 @alex_damian_ will present theory that can describe the dynamics of optimization in this regime! 🗓️ Nov 17, 3pm ET @scaleml
0
10
72
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
20
213
1K
Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee! https://t.co/FmT8Eo5SH3 (1/10)
6
98
430
[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) https://t.co/JgjcSi4AAY - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning
2
46
169
Hope everyone had a great time at M3L today! Many thanks to the speakers, authors, reviewers, participants and volunteers for all your contributions that made this workshop fun and successful, we hope to see you again next year! 😃✨
0
5
33
🚨💡We are organizing a workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2023! 🚀Join us if you are interested in exploring theories for understanding and advancing modern ML practice. https://t.co/iepaZUPlct Submission ddl: October 2, 2023 @M3LWorkshop
1
24
148
New paper with @alex_damian_ and @jasondeanlee! We identify a new implicit bias of GD: Self-Stabilization. When the loss is too sharp and iterates begin to diverge, self-stabilization decreases sharpness until GD is stable. This explains the “Edge of Stability” phenomenon! (1/3)
How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee
https://t.co/qcnu01a0Au (1/7)
1
5
24
We would also like to thank @deepcohen, @vfleaking, and @leichen1994 for helpful discussions throughout the course of this project. (7/7)
0
0
1
We precisely characterize this self-stabilization mechanism and show it accurately predicts the EoS dynamics of GD in standard deep learning tasks: (6/7)
1
0
2
A consequence of self-stabilization is that GD implicitly follows projected gradient descent on the constrained problem: minimize loss such that sharpness ≤ 2/lr. (5/7)
1
0
2
As the iterates diverge, the cubic term in the Taylor expansion of the gradient is precisely the gradient of the sharpness, which has the effect of reducing the sharpness until the dynamics are again stable. (4/7)
1
0
3
Our paper explains the EoS dynamics through a new implicit bias of GD called self-stabilization, which prevents divergence by aggressively reducing the sharpness once the loss spikes. Self-stabilization results from the cubic term in the Taylor expansion of the gradient. (3/7)
1
0
2
When the loss landscape becomes too sharp for the learning rate, the iterates diverge and the loss spikes. However, GD miraculously recovers from these spikes and continues to decrease the loss. https://t.co/PffO96peXT dubbed this regime the “Edge of Stability” (EoS). (2/7)
arxiv.org
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum...
1
0
3
How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee
https://t.co/qcnu01a0Au (1/7)
1
14
107
ArXiv https://t.co/iUhnNb4CmB: Analysis of SGD. Sharpness (largest eigenvalue of the Hessian) steadily increases during training until instability cutoff 2/η then it hovers around 2/η. Training loss still decreases. Reason: self-stabilization via cubic term in Taylor expansion.
arxiv.org
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the...
2
47
187