Jessy Lin
@realJessyLin
Followers
5K
Following
768
Media
50
Statuses
345
PhD @Berkeley_AI, visiting researcher @AIatMeta. Interactive language agents π€ π¬
Joined March 2013
As part of our recent work on memory layer architectures, I wrote up some of my thoughts on the continual learning problem broadly: Blog post: https://t.co/HNLqfNsQfN Some of the exposition goes beyond mem layers, so I thought it'd be useful to highlight separately:
26
170
1K
in our new post, we walk through great prior work from @agarwl_ & the @Alibaba_Qwen team exploring on-policy distillation using an open source recipe: you can run our experiments on Tinker today! https://t.co/7pVk87qTDH i'm especially excited by the use of on-policy
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
12
25
325
I talk about the design space in the post, and try to motivate why memory layers make sense from first principles. Overall, there's still so much to explore here and I'm excited to keep working on these questions β reach out if you're thinking about similar problems! :)
0
2
32
There's a huge spectrum of approaches to memory/continual learning - ranging from RAG to dreams of "infinite context" generalization to baking in new knowledge w/ gradient updates. I'm personally bullish on parametric updates that allow the model itself to get smarter over time
6
4
84
On integration: people usually talk about "catastrophic forgetting," but I think the broader problem is merging w/ knowledge that already exists in the model. We should design methods that make targeted updates while allowing the model to ππ¦π’π³π― how to organize its
1
1
32
On generalization: it's easy to memorize seqs verbatim, but we want models to learn the right "takeaway" from a piece of experience β e.g. when I give feedback to an agent like "You should call X function in the new API," what gradient update should it do? (not next token
1
0
34
I think of continual learning as two subproblems: πππ§ππ«ππ₯π’π³πππ’π¨π§: given a piece of data (user feedback, a piece of experience, etc.), what update should we do to learn the βimportant bitsβ? π
π¨π«π ππππ’π§π / ππ§πππ π«πππ’π¨π§: given a piece of data, how
1
0
45
@AIatMeta I'm really grateful for my great collaborators!β₯οΈ @LukeZettlemoyer @gargighosh @scottyih @aramHmarkosyan @vinceberges @barlas_berkeley πPaper: https://t.co/1eHp9OQfEm π Blog post with some broader thoughts on the continual learning problem: https://t.co/HNLqfNsQfN [n/n]
jessylin.com
2
10
101
@AIatMeta Looking forward: memory layers introduce a rich design space for architecture/memory design. E.g.: we pretrain memory layers so the model learns to organize its memory on diverse data, but it'd be interesting to add in new slots at finetuning time or investigate connections to
3
1
54
@AIatMeta We were curious whether memory slots actually store interpretable concepts. k=32 slots are accessed on each token, but if we visualize just the ones that are trainable with our method, they seem to align with entity boundaries. Beyond this prelim analysis, I think there's a lot
1
1
42
@AIatMeta To characterize the tradeoff between learning and forgetting for each method, we can sweep across the hyperparameters that control learning (e.g. rank/alpha and learning rate for LoRA) β sparse mem finetuning dominates. [6/n]
1
1
46
@AIatMeta We evaluate on two continual fact learning tasks: learning from a stream of TriviaQA facts and a stream of Wikipedia docs from SimpleQA. Sparse memory finetuning learns just as much as full finetuning and LoRA, but degrades much less on held-out tasks w/ the benefit of selective
1
0
42
@AIatMeta Leveraging this sparsity, we propose to update just the memory slots that are specific to a particular input β highly accessed on this input, but not frequently accessed on other data (e.g. pretraining), using TFIDF to rank the slots. This implements memory selectivity. When we
1
0
54
@AIatMeta In our work, we show how recently-proposed memory layer architectures provide a potential solution: replace a FFN with a sparse attention lookup into a huge pool of learned memory keys and values. This arch enables granular control over what params we update on each input.
2
2
61
@AIatMeta To learn something new, we shouldnβt need to finetune all the parameters of a large model. This motivates parameter-efficient methods for continual learning/memory, like LoRA and Cartridges, which add a small set of params to the model. However, LoRA is inherently low-capacity
1
1
64
π§ How can we equip LLMs with memory that allows them to continually learn new things? In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge. While full
52
299
2K
What if scaling the context windows of frontier LLMs is much easier than it sounds? Weβre excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,
125
351
3K
Excited to introduce Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model! ππ€ Dreamer 4 pushes the frontier of world model accuracy, speed, and learning complex tasks from offline datasets. co-led with @wilson1yan
82
354
3K
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
81
561
3K
@NickATomlin In case you missed part 1 - on the current challenges for LLM user simulators: https://t.co/VjZJUB9WWt
User simulators bridge RL with real-world interaction // https://t.co/bsrYxVHuVo How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
0
1
1