Avrajit Ghosh
@GhoshAvrajit
Followers
307
Following
10K
Media
25
Statuses
773
Postdoc @SimonsInstitute @berkeley_ai. Generalization, optimization, Inverse problems. PhD @MSU_EGR. (No prior) better than (wrong priors).
Berkeley, CA
Joined February 2020
Second-order methods and preconditioner-based methods are **NOT** the same. Please stop using them interchangeably!
6
11
129
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
arxiv.org
We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by...
0
2
14
If a problem seems intractable, it's almost always because your specification of it is vague or incomplete. The solution doesn't appear when you "think harder". It appears when you describe the problem in a sufficiently precise and explicit fashion -- until you see its true
59
157
1K
You rarely solve hard problems in a flash of insight. It's more typically a slow, careful process of exploring a branching tree of possibilities. You must pause, backtrack, and weigh every alternative. You can't fully do this in your head, because your working memory is too
93
306
3K
My role at Meta's SAM team (MSL, previously at FAIR Perception) has been impacted within 3 months of joining after PhD. If you work with multimodal LLMs for grounding or complex reasoning, or have a long-term vision of unified understanding and generation, let's talk. I am on
Meta has gone crazy on the squid game! Many new PhD NGs are deactivated today (I am also impacted🥲 happy to chat)
27
27
343
I'm hiring 2 PhD students & 1 postdoc @GeorgiaTech for Fall'26 Motivated students plz consider us, especially those in * ML+Quantum * DeepLearning+Optimization -PhD: see https://t.co/h4anjm6b8j -Postdoc: see https://t.co/548XVaahx3 & https://t.co/4ahNE7OOwV Retweet appreciated
9
120
466
Almost a decade ago, I coauthored a paper asking us to rethink our theory of generalization in machine learning. Today, I’m fine putting the theory back on the shelf.
argmin.net
You don't need a theorem to argue more data is better than less data
7
24
192
In machine learning, do you need to know any optimization algorithm other than stochastic gradient descent? A reluctant but best-faith argument for no.
argmin.net
Justifying a laser focus on stochastic gradient methods.
2
5
26
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
19
213
1K
@jasondeanlee @SebastienBubeck @tomgoldsteincs @zicokolter @atalwalkar This is the third, last, and best paper from my PhD. By some metrics, an ML PhD student who writes just three conference papers is "unproductive." But I wouldn't have had it any other way 😉 !
11
21
536
🚨New work: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking ( https://t.co/U7e0d3duYq) In this work we propose a mathematical framework, named Li2, that explains the dynamics of grokking (i.e., delayed generalization) in 2-layer nonlinear networks.
arxiv.org
While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of...
8
37
227
Proud of my junior collaborators Kijung Jeon Yuchen @YuchenZhu_ZYC Wei @WeiGuo01 Jaemoo @jaemoo51133 Avrajit @GhoshAvrajit Lianghe Shi Yinuo @Yinuo_Ren Haoxuan @haoxuan_steve_c - 6 joint #NeurIPS2025 main track paper! Lucky to have you Wanna join us? Will post recruit info soon.
1
7
76
sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD https://t.co/oAVKiVgUUQ
10
32
187
Information Geometry of Variational Bayes
arxiv.org
We highlight a fundamental connection between information geometry and variational Bayes (VB) and discuss its consequences for machine learning. Under certain conditions, a VB solution always...
1
27
162
The most important skill for a researcher is not technical ability. It's taste. The ability to identify interesting and tractable problems, and recognize important ideas when they show up. This can't be taught directly. It's cultivated through curiosity and broad reading.
100
569
4K
Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis
arxiv.org
We analyze gradient descent with Polyak heavy-ball momentum (HB) whose fixed momentum parameter $β\in (0, 1)$ provides exponential decay of memory. Building on Kovachki and Stuart (2021), we...
0
1
4
Another fantastic benchmark of optimizers. Key takeaways: 1. Variance-reduced Adam variants (e.g., MARS) achieve significant speedups over the AdamW baseline. 2. Matrix-based optimizers (e.g., Muon, SOAP) consistently outperform their scalar-based counterparts (e.g., Lion).
Fantastic Pretraining Optimizers and Where to Find Them "we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1–8× the Chinchilla optimum)." "we find that all the fastest optimizers such as Muon
5
22
187
Eigenvalue distribution of the Neural Tangent Kernel in the quadratic scaling
arxiv.org
We compute the asymptotic eigenvalue distribution of the neural tangent kernel of a two-layer neural network under a specific scaling of dimension. Namely, if $X\in\mathbb{R}^{n\times d}$ is an...
0
2
16
Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications
arxiv.org
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the...
0
1
8