Konstantin Mishchenko @konstmish X Profile

Konstantin Mishchenko

@konstmish

Followers

7K

Following

4K

Media

169

Statuses

691

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Paris, France

Joined June 2020

Don't wanna be here? Send us removal request.

Konstantin Mishchenko

@konstmish

9 months

A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it'd be worth sharing it here too:. 1. Adaptive optimization. There has been a lot going on in the last year, below are some.

13

140

902

Konstantin Mishchenko

@konstmish

1 day

0

1

18

Konstantin Mishchenko

@konstmish

1 day

You may think of optimization and linear algebra as established fields, but consider this: until 2008, convergence rate of SGD applied to linear systems wasn't established. I'm sure there are many other discoveries waiting to be made, things we'd be surprised later weren't known.

6

29

319

Konstantin Mishchenko

@konstmish

2 days

It's not just my opinion.

0

22

Konstantin Mishchenko

@konstmish

2 days

A gentle reminder that TMLR is a great journal that allows you to submit your papers when they are ready rather than rushing to meet conference deadlines. The review process is fast, there are no artificial acceptance rates, and you have more space to present your ideas in the.

13

22

269

Konstantin Mishchenko

@konstmish

2 days

RT @yenhuan_li: @konstmish @chungentsai It is perhaps helpful to recall Nesterov’s opinion:

0

4

0

Konstantin Mishchenko

@konstmish

3 days

Modular duality:

1

0

9

Konstantin Mishchenko

@konstmish

3 days

AdamC:

Aaron Defazio

@aaron_defazio

29 days

Why do gradients increase near the end of training? .Read the paper to find out!.We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training.

1

0

11

Konstantin Mishchenko

@konstmish

3 days

One thing that feels strange about AdamW is that it treats all network parameters identically - norm layers, attention, and dense layers all get the same update rule. Classical optimization, in contrast, uses tricks such as mirror descent with tailored mirror map to significantly.

5

11

161

Konstantin Mishchenko

@konstmish

4 days

0

2

15

Konstantin Mishchenko

@konstmish

4 days

Several research groups have released papers on the convergence of Muon, mostly looking at either as a Frank-Wolfe method with momentum or a trust region procedure. I find this one to be particularly easy to read.

5

21

253

Konstantin Mishchenko

@konstmish

4 days

I believe successful neural network training represents cases of "near convexity": the optimization landscape, while technically non-convex, behaves enough like a convex problem that standard convex optimization is often applicable. At the same time, *in general* neural nets.

Shalev Lifshitz

@Shalev_lif

6 days

The neural network objective function is a very complicated objective function. It's very non convex, and there are no mathematical guarantees whatsoever about its success. And so if you were to speak to somebody who studies optimization from a theoretical point of view, they

20

63

681

Konstantin Mishchenko

@konstmish

14 days

To solve variational inequality and nonlinear equations, the implicit update is used because the analogue of gradient descent may diverge even under monotonicity (analogue of convexity for operators). A good source on this is Malitsky & Tam:.5/5

1

0

12

Konstantin Mishchenko

@konstmish

14 days

Proximal operator is connected to implicit gradients. This is used in meta-learning, for instance see the iMAML method:.4/

1

11

Konstantin Mishchenko

@konstmish

14 days

Stochastic prox is also of interest in federated learning, where each worker minimizes a regularized objective. I think this paper by @akhaledv2 and @chijinML is the best source to learn more:.3/

1

0

11

Konstantin Mishchenko

@konstmish

14 days

Prox is fundamental for distributed optimization. It can be used to denote the constraint between multiple workers, which is known as "product trick". For instance, we used it this way in our ProxSkip paper (.2/

2

0

8

Konstantin Mishchenko

@konstmish

14 days

You'll often encounter the proximal operator in optimization papers. Prox is a fundamental tool that arises in distributed, federated, stochastic optimization, nonlinear equations., and meta learning. For a solid introduction, start with Parikh & Boyd:.1/

7

30

299

Konstantin Mishchenko

@konstmish

15 days

I find it particularly nice that we can reproduce the phenomenon using a simple *linear* problem. This immediately rules out nonconvexity as a leading factor.

2

1

17

Konstantin Mishchenko

@konstmish

15 days

0

3

12

Konstantin Mishchenko

@konstmish

15 days

There are several hypotheses for why Adam outperforms SGD on LLMs: heavy-tailed noise, blowing up curvature, near-constant magnitude of update, etc. The one I find most compelling is label imbalance: Adam specifically improves performance on rare classes, of which there are many.

12

36

304

Konstantin Mishchenko

@konstmish

16 days

RT @wazizian: ❓ How long does SGD take to reach the global minimum on non-convex functions?. With @FranckIutzeler, J. Malick, P. Mertikopou….

0

71

0