Alex Damian Profile
Alex Damian

@alex_damian_

Followers
590
Following
106
Media
2
Statuses
15

Research Fellow @KempnerInst | Incoming Asst Prof @MIT Math+EECS (Fall 2026) | Deep Learning Theory

Cambridge, MA
Joined March 2021
Don't wanna be here? Send us removal request.
@jyo_pari
Jyo Pari
1 month
Why do deep learning optimizers make progress even in the edge-of-stability regime? 🤔 @alex_damian_ will present theory that can describe the dynamics of optimization in this regime! 🗓️ Nov 17, 3pm ET @scaleml
0
10
72
@deepcohen
Jeremy Cohen
3 months
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
20
213
1K
@EshaanNichani
Eshaan Nichani
2 years
Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee! https://t.co/FmT8Eo5SH3 (1/10)
6
98
430
@fly51fly
fly51fly
2 years
[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) https://t.co/JgjcSi4AAY - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning
2
46
169
@M3LWorkshop
M3L Workshop @ NeurIPS 2024
2 years
Hope everyone had a great time at M3L today! Many thanks to the speakers, authors, reviewers, participants and volunteers for all your contributions that made this workshop fun and successful, we hope to see you again next year! 😃✨
0
5
33
@zhiyuanli_
Zhiyuan Li
2 years
🚨💡We are organizing a workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2023! 🚀Join us if you are interested in exploring theories for understanding and advancing modern ML practice. https://t.co/iepaZUPlct Submission ddl: October 2, 2023 @M3LWorkshop
1
24
148
@EshaanNichani
Eshaan Nichani
3 years
New paper with @alex_damian_ and @jasondeanlee! We identify a new implicit bias of GD: Self-Stabilization. When the loss is too sharp and iterates begin to diverge, self-stabilization decreases sharpness until GD is stable. This explains the “Edge of Stability” phenomenon! (1/3)
@alex_damian_
Alex Damian
3 years
How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee https://t.co/qcnu01a0Au (1/7)
1
5
24
@alex_damian_
Alex Damian
3 years
We would also like to thank @deepcohen, @vfleaking, and @leichen1994 for helpful discussions throughout the course of this project. (7/7)
0
0
1
@alex_damian_
Alex Damian
3 years
We precisely characterize this self-stabilization mechanism and show it accurately predicts the EoS dynamics of GD in standard deep learning tasks: (6/7)
1
0
2
@alex_damian_
Alex Damian
3 years
A consequence of self-stabilization is that GD implicitly follows projected gradient descent on the constrained problem: minimize loss such that sharpness ≤ 2/lr. (5/7)
1
0
2
@alex_damian_
Alex Damian
3 years
As the iterates diverge, the cubic term in the Taylor expansion of the gradient is precisely the gradient of the sharpness, which has the effect of reducing the sharpness until the dynamics are again stable. (4/7)
1
0
3
@alex_damian_
Alex Damian
3 years
Our paper explains the EoS dynamics through a new implicit bias of GD called self-stabilization, which prevents divergence by aggressively reducing the sharpness once the loss spikes. Self-stabilization results from the cubic term in the Taylor expansion of the gradient. (3/7)
1
0
2
@alex_damian_
Alex Damian
3 years
When the loss landscape becomes too sharp for the learning rate, the iterates diverge and the loss spikes. However, GD miraculously recovers from these spikes and continues to decrease the loss. https://t.co/PffO96peXT dubbed this regime the “Edge of Stability” (EoS). (2/7)
Tweet card summary image
arxiv.org
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum...
1
0
3
@alex_damian_
Alex Damian
3 years
How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee https://t.co/qcnu01a0Au (1/7)
1
14
107
@HochreiterSepp
Sepp Hochreiter
3 years
ArXiv https://t.co/iUhnNb4CmB: Analysis of SGD. Sharpness (largest eigenvalue of the Hessian) steadily increases during training until instability cutoff 2/η then it hovers around 2/η. Training loss still decreases. Reason: self-stabilization via cubic term in Taylor expansion.
Tweet card summary image
arxiv.org
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the...
2
47
187