Andrey Gromov Profile
Andrey Gromov

@Andr3yGR

Followers
280
Following
819
Media
11
Statuses
59

Meta FAIR Research Scientist & physics professor at University of Maryland, College Park

Bay Area
Joined June 2009
Don't wanna be here? Send us removal request.
@InnaVishik
Inna Vishik
2 months
Phil Anderson was one of the greats of 20th century physics. A biography and summary of his greatest hits, written by physicists who knew him well, is on arXiv https://t.co/hTu6VNHslK
0
2
18
@Andr3yGR
Andrey Gromov
4 months
Excited to be a part of this!
@SimonsFdn
Simons Foundation
4 months
Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine:
0
4
21
@Andr3yGR
Andrey Gromov
6 months
Thank you!
0
0
2
@Andr3yGR
Andrey Gromov
6 months
We experimented with breaking transformer blocks into Attention and MLP. Then we let DNA models decide how to stack them. We find that models generally prefer more attention early on and more MLP later on. 9/
1
0
3
@Andr3yGR
Andrey Gromov
6 months
The paths specialize to sometimes simple and sometimes complex structures: versions of “to be”, sentence-level attention, commas, this and that. 8/
1
0
2
@Andr3yGR
Andrey Gromov
6 months
We find that in language the paths followed by the tokens are distributed according to a power-law. This reflects extreme diversity of the language structures. Language DNAs are sparse right away. 7/
1
0
2
@Andr3yGR
Andrey Gromov
6 months
Attention modules show emergent, dynamical (meaning, input-dependent sparsity.) Different attention/transformer modules focus on objects or background or boundaries. The model is trying to segment the image. 6/
1
0
4
@Andr3yGR
Andrey Gromov
6 months
Furthermore, using deep-dream-like methods we can recover many features of the input image just from knowing the paths (essentially collections of integers) that image takes through DNA. This gives an idea of how informative paths are. 5/
1
0
2
@Andr3yGR
Andrey Gromov
6 months
We find that paths that tokens take through the DNA are interpretable. Patches/tokens with similar content or context take the same paths. (Something similar should hold true for classic MoE, but we have not checked yet.) 4/
1
0
4
@Andr3yGR
Andrey Gromov
6 months
DNAs show emergent connectivity and computation that are very different from their dense counterparts, while showing competitive performance at ~25% less FLOPS. Vision models are dense in their first half and are sparse in the second. 3/
1
0
3
@Andr3yGR
Andrey Gromov
6 months
We taught DNAs to allocate compute based on the content and context of each token/patch. The model's choices are human interpretable and tell us that the vision model is essentially segmenting the image. Images that are hard to segment cost more compute. 2/
1
0
3
@Andr3yGR
Andrey Gromov
6 months
Do neural networks have to be feed-forward? We built a collection of Distributed Neural Architectures (DNAs) in vision and language domains where all modules can talk to each other at the same time and non-feedforward connectivity emerges from end-to-end training. 1/
1
0
6
@Andr3yGR
Andrey Gromov
6 months
New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵
2
32
174
@jxmnop
dr. jack morris
6 months
32
213
2K
@BorisHanin
Boris Hanin
8 months
What an incredible lineup of panelists and researchers! Super excited to attend this.
@mithrilcompute
Mithril (formerly Foundry)
8 months
🚨 🧬 AI for Science Symposium 🔭🚨 We're gathering AI 4 Science leaders from industry (@vkhosla, @OpenAI) academia (@MoAlQuraishi, @iaifi_news) gov (@patrickshafto, @BerkeleyLab) non-profits (@JoanneZPeng, @oziadias) Join us May 16 in SF Registration link and more info ⬇️
3
3
17
@Andr3yGR
Andrey Gromov
10 months
Fun collaboration!
@tydsh
Yuandong Tian
10 months
Our new work Spectral Journey https://t.co/1C4Hrxb2Ig shows a surprising finding: when a 2-layer Transformer is learned to predict the shortest path of a given graph, 1️⃣it first implicitly computes the spectral embedding for each edge, i.e. eigenvectors of Normalized Graph
0
0
7
@tydsh
Yuandong Tian
10 months
Our new work Spectral Journey https://t.co/1C4Hrxb2Ig shows a surprising finding: when a 2-layer Transformer is learned to predict the shortest path of a given graph, 1️⃣it first implicitly computes the spectral embedding for each edge, i.e. eigenvectors of Normalized Graph
Tweet card summary image
arxiv.org
Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress...
8
90
466
@darshilhdoshi1
Darshil Doshi
1 year
Interested in mechanistic interpretability of how Transformers learn in-context via skill composition? Come to our #NeurIPS2024 Oral presentation! 📅 Wed, Dec 11 ⏰ 10:00 AM (oral), 11:00 AM - 2 PM (poster) 📍East Ballroom A-B (oral), East Exhibit Hall A-C #3200 (poster)
1
1
4
@MBarkeshli
Maissam Barkeshli
1 year
John Hopfield has a nice article in the annual reviews of condensed matter physics. It starts off with a discussion of what physics is, which I think is totally on point.
16
168
853