Andrey Gromov Profile
Andrey Gromov

@Andr3yGR

Followers
232
Following
750
Media
11
Statuses
56

Meta FAIR Research Scientist & physics professor at University of Maryland, College Park

Bay Area
Joined June 2009
Don't wanna be here? Send us removal request.
@Andr3yGR
Andrey Gromov
11 days
Excited to be a part of this!.
@SimonsFdn
Simons Foundation
11 days
Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine:.
0
4
20
@Andr3yGR
Andrey Gromov
2 months
Thank you!.
0
0
2
@Andr3yGR
Andrey Gromov
2 months
There are more experiments and visualizations in the paper Routing and conditional computing should be taken more seriously. 10/.
Tweet card summary image
arxiv.org
We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.)...
2
0
13
@Andr3yGR
Andrey Gromov
2 months
We experimented with breaking transformer blocks into Attention and MLP. Then we let DNA models decide how to stack them. We find that models generally prefer more attention early on and more MLP later on. 9/
Tweet media one
1
0
3
@Andr3yGR
Andrey Gromov
2 months
The paths specialize to sometimes simple and sometimes complex structures: versions of “to be”, sentence-level attention, commas, this and that. 8/
Tweet media one
1
0
2
@Andr3yGR
Andrey Gromov
2 months
We find that in language the paths followed by the tokens are distributed according to a power-law. This reflects extreme diversity of the language structures. Language DNAs are sparse right away. 7/
Tweet media one
1
0
2
@Andr3yGR
Andrey Gromov
2 months
Attention modules show emergent, dynamical (meaning, input-dependent sparsity.) Different attention/transformer modules focus on objects or background or boundaries. The model is trying to segment the image. 6/
Tweet media one
1
0
4
@Andr3yGR
Andrey Gromov
2 months
Furthermore, using deep-dream-like methods we can recover many features of the input image just from knowing the paths (essentially collections of integers) that image takes through DNA. This gives an idea of how informative paths are. 5/
Tweet media one
1
0
2
@Andr3yGR
Andrey Gromov
2 months
We find that paths that tokens take through the DNA are interpretable. Patches/tokens with similar content or context take the same paths. (Something similar should hold true for classic MoE, but we have not checked yet.) 4/
Tweet media one
1
0
4
@Andr3yGR
Andrey Gromov
2 months
DNAs show emergent connectivity and computation that are very different from their dense counterparts, while showing competitive performance at ~25% less FLOPS. Vision models are dense in their first half and are sparse in the second. 3/
Tweet media one
1
0
3
@Andr3yGR
Andrey Gromov
2 months
We taught DNAs to allocate compute based on the content and context of each token/patch. The model's choices are human interpretable and tell us that the vision model is essentially segmenting the image. Images that are hard to segment cost more compute. 2/
Tweet media one
1
0
3
@Andr3yGR
Andrey Gromov
2 months
Do neural networks have to be feed-forward? We built a collection of Distributed Neural Architectures (DNAs) in vision and language domains where all modules can talk to each other at the same time and non-feedforward connectivity emerges from end-to-end training. 1/
Tweet media one
1
0
6
@Andr3yGR
Andrey Gromov
2 months
New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵
Tweet media one
2
32
175
@Andr3yGR
Andrey Gromov
3 months
RT @jxmnop:
Tweet media one
0
219
0
@Andr3yGR
Andrey Gromov
4 months
RT @BorisHanin: What an incredible lineup of panelists and researchers!. Super excited to attend this.
0
3
0
@Andr3yGR
Andrey Gromov
7 months
Fun collaboration!.
@tydsh
Yuandong Tian
7 months
Our new work Spectral Journey shows a surprising finding: when a 2-layer Transformer is learned to predict the shortest path of a given graph, . 1️⃣it first implicitly computes the spectral embedding for each edge, i.e. eigenvectors of Normalized Graph.
0
0
7
@Andr3yGR
Andrey Gromov
9 months
RT @darshilhdoshi1: Interested in mechanistic interpretability of how Transformers learn in-context via skill composition? Come to our #Neu….
0
1
0
@Andr3yGR
Andrey Gromov
11 months
RT @MBarkeshli: John Hopfield has a nice article in the annual reviews of condensed matter physics. It starts off with a discussion of what….
0
167
0
@Andr3yGR
Andrey Gromov
11 months
RT @MBarkeshli: The Nobel Committee recognizes profound contributions from Physics to ML / AI. There's a lot more where that came from. We….
0
3
0