SaxeLab Profile Banner
Andrew Saxe Profile
Andrew Saxe

@SaxeLab

Followers
5K
Following
2K
Media
82
Statuses
697

Prof at @GatsbyUCL and @SWC_Neuro, trying to figure out how we learn. Bluesky: @SaxeLab Mastodon: @[email protected]

London, UK
Joined November 2019
Don't wanna be here? Send us removal request.
@SaxeLab
Andrew Saxe
1 month
How does in-context learning emerge in attention models during gradient descent training? . Sharing our new Spotlight paper @icmlconf: Training Dynamics of In-Context Learning in Linear Attention . Led by Yedi Zhang with @Aaditya6284 and Peter Latham
2
22
110
@SaxeLab
Andrew Saxe
20 days
RT @a_proca: How do task dynamics impact learning in networks with internal dynamics?. Excited to share our ICML Oral paper on learning dyn….
0
18
0
@SaxeLab
Andrew Saxe
1 month
RT @Aaditya6284: Excited to share this work has been accepted as an Oral at #icml2025 -- looking forward to seeing everyone in Vancouver, a….
0
5
0
@SaxeLab
Andrew Saxe
1 month
RT @Aaditya6284: Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why?. Excited to….
0
23
0
@SaxeLab
Andrew Saxe
1 month
RT @Aaditya6284: Was super fun to be a part of this work! Felt very satisfying to bring the theory work on ICL with linear attention a bit….
0
5
0
@SaxeLab
Andrew Saxe
1 month
Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query are parametrized.
0
0
3
@SaxeLab
Andrew Saxe
1 month
PCR is an interesting inductive bias--it is considered effective when data has a certain kind of low-dimensional structure in which a few 'signal' directions with higher input variance are buried in a background of lower input variance 'noise' directions.
1
0
4
@SaxeLab
Andrew Saxe
1 month
During training, linear attention with separate key and query implements principal component regression (PCR) in context with the number of PCs increasing over time. If training stops during the (m+1)-th plateau, it approximately implements PCR in context with the first m PCs.
1
0
4
@SaxeLab
Andrew Saxe
1 month
For linear attention with separate key and query, we show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations.
Tweet media one
1
0
3
@SaxeLab
Andrew Saxe
1 month
For linear attention with merged key and query, we show that its training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an exact analytical time-course solution for a certain class of datasets and initializations.
Tweet media one
1
0
6
@SaxeLab
Andrew Saxe
1 month
We study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine 2 common parametrizations of linear attention: one with the key and query weights merged as a single matrix, and one with separate key and query weights
Tweet media one
1
0
4
@SaxeLab
Andrew Saxe
2 months
RT @sebastiangoldt: Really happy to see this paper out, led by @nishpathead in collaboration with @stefsmlab and @SaxeLab: we apply the sta….
0
7
0
@SaxeLab
Andrew Saxe
2 months
RT @stefsmlab: Our paper just came out in PRX! . Congrats to @nishpathead and the rest of the team. TL;DR : We analyse neural network lear….
0
3
0
@SaxeLab
Andrew Saxe
2 months
RT @GabyMohamady: How do cognitive maps fail? And how can this help us understand/treat psychosis? My lab at Experimental Psychology, Oxfor….
0
15
0
@SaxeLab
Andrew Saxe
2 months
RT @sebastiangoldt: If I had known about this master when I was coming out of my Bachelor, I would have applied in a heartbeat, so please h….
0
5
0
@SaxeLab
Andrew Saxe
3 months
RT @devonjarvi5: Our paper, “Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks” will b….
0
16
0
@SaxeLab
Andrew Saxe
3 months
RT @zdeborova: Happy to share the recording of my plenary talk at Cosyne 2025 two days ago. You will learn about the statistical physics ap….
0
21
0
@SaxeLab
Andrew Saxe
4 months
RT @ClementineDomi6: Our paper, “A Theory of Initialization’s Impact on Specialization,” has been accepted to ICLR 2025!..
0
20
0
@SaxeLab
Andrew Saxe
4 months
RT @BlavatnikAwards: 2025 @Blavatnikawards UK 🇬🇧 Finalist Andrew Saxe from UCL was featured on the @BBC Science Focus Instant Genius Podcas….
0
9
0
@SaxeLab
Andrew Saxe
7 months
New paper with @leonlufkin and @ermgrant! . Why do we see localized receptive fields so often, even in models without sparisity regularization?. We present a theory in the minimal setting from @ai_ngrosso and @sebastiangoldt.
@leonlufkin
Leon
7 months
We’re excited to share our paper analyzing how data drives the emergence of localized receptive fields in neural networks! w/ @SaxeLab @ermgrant. Come see our #NeurIPS2024 spotlight poster today at 4:30–7:30 in the East Hall!. Paper:
Tweet media one
0
14
87