konstmish Profile Banner
Konstantin Mishchenko Profile
Konstantin Mishchenko

@konstmish

Followers
7K
Following
4K
Media
169
Statuses
691

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Paris, France
Joined June 2020
Don't wanna be here? Send us removal request.
@konstmish
Konstantin Mishchenko
9 months
A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it'd be worth sharing it here too:. 1. Adaptive optimization. There has been a lot going on in the last year, below are some.
13
140
902
@konstmish
Konstantin Mishchenko
1 day
0
1
18
@konstmish
Konstantin Mishchenko
1 day
You may think of optimization and linear algebra as established fields, but consider this: until 2008, convergence rate of SGD applied to linear systems wasn't established. I'm sure there are many other discoveries waiting to be made, things we'd be surprised later weren't known.
Tweet media one
6
29
319
@konstmish
Konstantin Mishchenko
2 days
It's not just my opinion.
Tweet media one
0
0
22
@konstmish
Konstantin Mishchenko
2 days
A gentle reminder that TMLR is a great journal that allows you to submit your papers when they are ready rather than rushing to meet conference deadlines. The review process is fast, there are no artificial acceptance rates, and you have more space to present your ideas in the.
13
22
269
@konstmish
Konstantin Mishchenko
2 days
RT @yenhuan_li: @konstmish @chungentsai It is perhaps helpful to recall Nesterov’s opinion:
Tweet media one
0
4
0
@konstmish
Konstantin Mishchenko
3 days
Modular duality:
Tweet media one
1
0
9
@konstmish
Konstantin Mishchenko
3 days
AdamC:
@aaron_defazio
Aaron Defazio
29 days
Why do gradients increase near the end of training? .Read the paper to find out!.We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training.
Tweet media one
1
0
11
@konstmish
Konstantin Mishchenko
3 days
One thing that feels strange about AdamW is that it treats all network parameters identically - norm layers, attention, and dense layers all get the same update rule. Classical optimization, in contrast, uses tricks such as mirror descent with tailored mirror map to significantly.
5
11
161
@konstmish
Konstantin Mishchenko
4 days
0
2
15
@konstmish
Konstantin Mishchenko
4 days
Several research groups have released papers on the convergence of Muon, mostly looking at either as a Frank-Wolfe method with momentum or a trust region procedure. I find this one to be particularly easy to read.
Tweet media one
5
21
253
@konstmish
Konstantin Mishchenko
4 days
I believe successful neural network training represents cases of "near convexity": the optimization landscape, while technically non-convex, behaves enough like a convex problem that standard convex optimization is often applicable. At the same time, *in general* neural nets.
@Shalev_lif
Shalev Lifshitz
6 days
The neural network objective function is a very complicated objective function. It's very non convex, and there are no mathematical guarantees whatsoever about its success. And so if you were to speak to somebody who studies optimization from a theoretical point of view, they
Tweet media one
20
63
681
@konstmish
Konstantin Mishchenko
14 days
To solve variational inequality and nonlinear equations, the implicit update is used because the analogue of gradient descent may diverge even under monotonicity (analogue of convexity for operators). A good source on this is Malitsky & Tam:.5/5
Tweet media one
1
0
12
@konstmish
Konstantin Mishchenko
14 days
Proximal operator is connected to implicit gradients. This is used in meta-learning, for instance see the iMAML method:.4/
Tweet media one
1
1
11
@konstmish
Konstantin Mishchenko
14 days
Stochastic prox is also of interest in federated learning, where each worker minimizes a regularized objective. I think this paper by @akhaledv2 and @chijinML is the best source to learn more:.3/
Tweet media one
1
0
11
@konstmish
Konstantin Mishchenko
14 days
Prox is fundamental for distributed optimization. It can be used to denote the constraint between multiple workers, which is known as "product trick". For instance, we used it this way in our ProxSkip paper (.2/
Tweet media one
2
0
8
@konstmish
Konstantin Mishchenko
14 days
You'll often encounter the proximal operator in optimization papers. Prox is a fundamental tool that arises in distributed, federated, stochastic optimization, nonlinear equations., and meta learning. For a solid introduction, start with Parikh & Boyd:.1/
Tweet media one
7
30
299
@konstmish
Konstantin Mishchenko
15 days
I find it particularly nice that we can reproduce the phenomenon using a simple *linear* problem. This immediately rules out nonconvexity as a leading factor.
Tweet media one
2
1
17
@konstmish
Konstantin Mishchenko
15 days
0
3
12
@konstmish
Konstantin Mishchenko
15 days
There are several hypotheses for why Adam outperforms SGD on LLMs: heavy-tailed noise, blowing up curvature, near-constant magnitude of update, etc. The one I find most compelling is label imbalance: Adam specifically improves performance on rare classes, of which there are many.
Tweet media one
12
36
304
@konstmish
Konstantin Mishchenko
16 days
RT @wazizian: ❓ How long does SGD take to reach the global minimum on non-convex functions?. With @FranckIutzeler, J. Malick, P. Mertikopou….
0
71
0