Suraj Anand Profile
Suraj Anand

@surajk610

Followers
82
Following
248
Media
6
Statuses
18

Brown University

Joined October 2022
Don't wanna be here? Send us removal request.
@surajk610
Suraj Anand
2 months
RT @Michael_Lepori: I'm very excited that this work was accepted for an oral presentation @naacl! Come by at 10:45 on Thursday to hear how….
0
3
0
@surajk610
Suraj Anand
2 months
Excited to be at #ICLR2025 in a few days to present this work with @Michael_Lepori! Interested in chatting about training dynamics, mechinterp, memory-efficient training, info theory or anything else! Please dm me.
@surajk610
Suraj Anand
1 year
How robust are in-context algorithms? In new work with @michael_lepori, @jack_merullo, and @brown_nlp, we explore why in-context learning disappears over training and fails on rare and unseen tokens. We also introduce a training intervention that fixes these failures.
Tweet media one
0
3
12
@surajk610
Suraj Anand
4 months
RT @Aaditya6284: Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why?. Excited to….
0
23
0
@surajk610
Suraj Anand
1 year
See the preprint for many more details. Many thanks to @BrownCSDept and @CarneyInstitute for supporting this work! Link:
0
0
2
@surajk610
Suraj Anand
1 year
One strong reason for the success of LMs is their capacity for ICL and IWL strategies to coexist, a behavior that organically occurs with a moderately skewed Zipfian distribution. We can now develop a dual process strategy for higher skew distributions!.
1
0
1
@surajk610
Suraj Anand
1 year
Our main finding is a simple training protocol that results in flexible models! We can now achieve great ICL generalization on rare tokens and new tokens.
1
0
3
@surajk610
Suraj Anand
1 year
By choosing an optimal N, we can encode a dual process strategy for many distributions of tokens, while maintaining structural ICL performance on all distributions.
Tweet media one
1
0
2
@surajk610
Suraj Anand
1 year
We find that by varying N, we can vary the model’s dependence on in-weights information on frequently seen tokens while maintaining structural ICL performance on unseen tokens.
Tweet media one
1
0
2
@surajk610
Suraj Anand
1 year
To retain useful in-weights information, we introduce temporary forgetting. This involves active forgetting every k steps during the first N steps (N >> k) of training, after which we allow the embedding matrix to train as usual.
1
0
1
@surajk610
Suraj Anand
1 year
But resetting the embedding matrix destroys the model’s ability to encode semantic information. A favorable behavior for a model is to encode a dual process strategy: maintain an ICL solution for uncommon/unseen tokens while memorizing information in-weights for frequent tokens.
1
0
1
@surajk610
Suraj Anand
1 year
To promote structural ICL, we utilize a training procedure recently introduced by @yihong_thu et al.: active forgetting. We re-initialize the embedding matrix every k steps during training so each token’s embedding encodes no information and the model must use structural ICL.
Tweet media one
2
0
5
@surajk610
Suraj Anand
1 year
In both settings, we find that structural ICL is transient–that is, the performance of in-context algorithms on unseen tokens emerges early in training, but quickly vanishes. In this paper, we explore how to maintain this ability without sacrificing model performance.
Tweet media one
1
0
3
@surajk610
Suraj Anand
1 year
In a naturalistic and synthetic settings, we study ICL on rare and unseen tokens, which we term structural ICL. In structural ICL settings, models must generalize purely on the basis of e.g. sentence or task structure, rather than semantic content encoded in token embeddings.
Tweet media one
1
0
3
@surajk610
Suraj Anand
1 year
How robust are in-context algorithms? In new work with @michael_lepori, @jack_merullo, and @brown_nlp, we explore why in-context learning disappears over training and fails on rare and unseen tokens. We also introduce a training intervention that fixes these failures.
Tweet media one
2
12
87
@surajk610
Suraj Anand
1 year
RT @synth_labs: PINK ELEPHANTS! 🐘 Now, don’t think about it. Chatbots also find this supremely difficult. Ask one of the most popular open….
0
20
0
@surajk610
Suraj Anand
1 year
RT @jack_merullo_: Our #ICLR2024 paper was accepted as a spotlight: We look at whether language models reuse attention heads for functional….
0
19
0