Andrey Gromov Profile
Andrey Gromov

@Andr3yGR

Followers
249
Following
780
Media
11
Statuses
57

Meta FAIR Research Scientist & physics professor at University of Maryland, College Park

Bay Area
Joined June 2009
Don't wanna be here? Send us removal request.
@Andr3yGR
Andrey Gromov
2 months
Excited to be a part of this!
@SimonsFdn
Simons Foundation
2 months
Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine:
0
4
21
@Andr3yGR
Andrey Gromov
4 months
Thank you!
0
0
2
@Andr3yGR
Andrey Gromov
4 months
We experimented with breaking transformer blocks into Attention and MLP. Then we let DNA models decide how to stack them. We find that models generally prefer more attention early on and more MLP later on. 9/
1
0
3
@Andr3yGR
Andrey Gromov
4 months
The paths specialize to sometimes simple and sometimes complex structures: versions of “to be”, sentence-level attention, commas, this and that. 8/
1
0
2
@Andr3yGR
Andrey Gromov
4 months
We find that in language the paths followed by the tokens are distributed according to a power-law. This reflects extreme diversity of the language structures. Language DNAs are sparse right away. 7/
1
0
2
@Andr3yGR
Andrey Gromov
4 months
Attention modules show emergent, dynamical (meaning, input-dependent sparsity.) Different attention/transformer modules focus on objects or background or boundaries. The model is trying to segment the image. 6/
1
0
4
@Andr3yGR
Andrey Gromov
4 months
Furthermore, using deep-dream-like methods we can recover many features of the input image just from knowing the paths (essentially collections of integers) that image takes through DNA. This gives an idea of how informative paths are. 5/
1
0
2
@Andr3yGR
Andrey Gromov
4 months
We find that paths that tokens take through the DNA are interpretable. Patches/tokens with similar content or context take the same paths. (Something similar should hold true for classic MoE, but we have not checked yet.) 4/
1
0
4
@Andr3yGR
Andrey Gromov
4 months
DNAs show emergent connectivity and computation that are very different from their dense counterparts, while showing competitive performance at ~25% less FLOPS. Vision models are dense in their first half and are sparse in the second. 3/
1
0
3
@Andr3yGR
Andrey Gromov
4 months
We taught DNAs to allocate compute based on the content and context of each token/patch. The model's choices are human interpretable and tell us that the vision model is essentially segmenting the image. Images that are hard to segment cost more compute. 2/
1
0
3
@Andr3yGR
Andrey Gromov
4 months
Do neural networks have to be feed-forward? We built a collection of Distributed Neural Architectures (DNAs) in vision and language domains where all modules can talk to each other at the same time and non-feedforward connectivity emerges from end-to-end training. 1/
1
0
6
@Andr3yGR
Andrey Gromov
4 months
New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵
2
32
174
@jxmnop
Jack Morris
4 months
31
217
2K
@BorisHanin
Boris Hanin
6 months
What an incredible lineup of panelists and researchers! Super excited to attend this.
@mithrilcompute
Mithril (formerly Foundry)
6 months
🚨 🧬 AI for Science Symposium 🔭🚨 We're gathering AI 4 Science leaders from industry (@vkhosla, @OpenAI) academia (@MoAlQuraishi, @iaifi_news) gov (@patrickshafto, @BerkeleyLab) non-profits (@JoanneZPeng, @oziadias) Join us May 16 in SF Registration link and more info ⬇️
3
3
18
@Andr3yGR
Andrey Gromov
8 months
Fun collaboration!
@tydsh
Yuandong Tian
8 months
Our new work Spectral Journey https://t.co/1C4Hrxb2Ig shows a surprising finding: when a 2-layer Transformer is learned to predict the shortest path of a given graph, 1️⃣it first implicitly computes the spectral embedding for each edge, i.e. eigenvectors of Normalized Graph
0
0
7
@tydsh
Yuandong Tian
8 months
Our new work Spectral Journey https://t.co/1C4Hrxb2Ig shows a surprising finding: when a 2-layer Transformer is learned to predict the shortest path of a given graph, 1️⃣it first implicitly computes the spectral embedding for each edge, i.e. eigenvectors of Normalized Graph
Tweet card summary image
arxiv.org
Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress...
9
90
470
@darshilhdoshi1
Darshil Doshi @ICML2025
10 months
Interested in mechanistic interpretability of how Transformers learn in-context via skill composition? Come to our #NeurIPS2024 Oral presentation! 📅 Wed, Dec 11 ⏰ 10:00 AM (oral), 11:00 AM - 2 PM (poster) 📍East Ballroom A-B (oral), East Exhibit Hall A-C #3200 (poster)
1
1
4
@MBarkeshli
Maissam Barkeshli
1 year
John Hopfield has a nice article in the annual reviews of condensed matter physics. It starts off with a discussion of what physics is, which I think is totally on point.
16
168
868
@MBarkeshli
Maissam Barkeshli
1 year
The Nobel Committee recognizes profound contributions from Physics to ML / AI. There's a lot more where that came from. We are in an era where an increasing number of physicists are making important contributions to ML / AI, and even more are needed going forward.
@NobelPrize
The Nobel Prize
1 year
BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Physics to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.���
2
3
25