Alex Damian @alex_damian_ X Profile

Alex Damian

@alex_damian_

Followers

590

Following

106

Media

2

Statuses

15

Research Fellow @KempnerInst | Incoming Asst Prof @MIT Math+EECS (Fall 2026) | Deep Learning Theory

https://t.co/PcRc4GxMn1

Cambridge, MA

Joined March 2021

Don't wanna be here? Send us removal request.

Jyo Pari

@jyo_pari

1 month

Why do deep learning optimizers make progress even in the edge-of-stability regime? 🤔 @alex_damian_ will present theory that can describe the dynamics of optimization in this regime! 🗓️ Nov 17, 3pm ET @scaleml

0

10

72

Jeremy Cohen

@deepcohen

3 months

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

20

213

1K

Eshaan Nichani

@EshaanNichani

2 years

Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee! https://t.co/FmT8Eo5SH3 (1/10)

6

98

430

fly51fly

@fly51fly

2 years

[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) https://t.co/JgjcSi4AAY - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning

2

46

169

M3L Workshop @ NeurIPS 2024

@M3LWorkshop

2 years

Hope everyone had a great time at M3L today! Many thanks to the speakers, authors, reviewers, participants and volunteers for all your contributions that made this workshop fun and successful, we hope to see you again next year! 😃✨

0

5

33

Zhiyuan Li

@zhiyuanli_

2 years

🚨💡We are organizing a workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2023! 🚀Join us if you are interested in exploring theories for understanding and advancing modern ML practice. https://t.co/iepaZUPlct Submission ddl: October 2, 2023 @M3LWorkshop

1

24

148

Eshaan Nichani

@EshaanNichani

3 years

New paper with @alex_damian_ and @jasondeanlee! We identify a new implicit bias of GD: Self-Stabilization. When the loss is too sharp and iterates begin to diverge, self-stabilization decreases sharpness until GD is stable. This explains the “Edge of Stability” phenomenon! (1/3)

Alex Damian

@alex_damian_

3 years

How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee https://t.co/qcnu01a0Au (1/7)

1

5

24

Alex Damian

@alex_damian_

3 years

We would also like to thank @deepcohen, @vfleaking, and @leichen1994 for helpful discussions throughout the course of this project. (7/7)

0

1

Alex Damian

@alex_damian_

3 years

We precisely characterize this self-stabilization mechanism and show it accurately predicts the EoS dynamics of GD in standard deep learning tasks: (6/7)

1

0

2

Alex Damian

@alex_damian_

3 years

A consequence of self-stabilization is that GD implicitly follows projected gradient descent on the constrained problem: minimize loss such that sharpness ≤ 2/lr. (5/7)

1

0

2

Alex Damian

@alex_damian_

3 years

As the iterates diverge, the cubic term in the Taylor expansion of the gradient is precisely the gradient of the sharpness, which has the effect of reducing the sharpness until the dynamics are again stable. (4/7)

1

0

3

Alex Damian

@alex_damian_

3 years

Our paper explains the EoS dynamics through a new implicit bias of GD called self-stabilization, which prevents divergence by aggressively reducing the sharpness once the loss spikes. Self-stabilization results from the cubic term in the Taylor expansion of the gradient. (3/7)

1

0

2

Alex Damian

@alex_damian_

3 years

When the loss landscape becomes too sharp for the learning rate, the iterates diverge and the loss spikes. However, GD miraculously recovers from these spikes and continues to decrease the loss. https://t.co/PffO96peXT dubbed this regime the “Edge of Stability” (EoS). (2/7)

arxiv.org

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum...

1

0

3

Alex Damian

@alex_damian_

3 years

How does GD recover from spikes in the loss? A 🧵 on our new paper “Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability” with @EshaanNichani and @jasondeanlee https://t.co/qcnu01a0Au (1/7)

1

14

107

Sepp Hochreiter

@HochreiterSepp

3 years

ArXiv https://t.co/iUhnNb4CmB: Analysis of SGD. Sharpness (largest eigenvalue of the Hessian) steadily increases during training until instability cutoff 2/η then it hovers around 2/η. Training loss still decreases. Reason: self-stabilization via cubic term in Taylor expansion.

arxiv.org

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the...

2

47

187