realJessyLin Profile Banner
Jessy Lin Profile
Jessy Lin

@realJessyLin

Followers
5K
Following
768
Media
50
Statuses
345

PhD @Berkeley_AI, visiting researcher @AIatMeta. Interactive language agents πŸ€– πŸ’¬

Joined March 2013
Don't wanna be here? Send us removal request.
@realJessyLin
Jessy Lin
14 days
As part of our recent work on memory layer architectures, I wrote up some of my thoughts on the continual learning problem broadly: Blog post: https://t.co/HNLqfNsQfN Some of the exposition goes beyond mem layers, so I thought it'd be useful to highlight separately:
26
170
1K
@_kevinlu
Kevin Lu
8 days
in our new post, we walk through great prior work from @agarwl_ & the @Alibaba_Qwen team exploring on-policy distillation using an open source recipe: you can run our experiments on Tinker today! https://t.co/7pVk87qTDH i'm especially excited by the use of on-policy
@thinkymachines
Thinking Machines
8 days
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
12
25
325
@realJessyLin
Jessy Lin
14 days
I talk about the design space in the post, and try to motivate why memory layers make sense from first principles. Overall, there's still so much to explore here and I'm excited to keep working on these questions – reach out if you're thinking about similar problems! :)
0
2
32
@realJessyLin
Jessy Lin
14 days
There's a huge spectrum of approaches to memory/continual learning - ranging from RAG to dreams of "infinite context" generalization to baking in new knowledge w/ gradient updates. I'm personally bullish on parametric updates that allow the model itself to get smarter over time
6
4
84
@realJessyLin
Jessy Lin
14 days
On integration: people usually talk about "catastrophic forgetting," but I think the broader problem is merging w/ knowledge that already exists in the model. We should design methods that make targeted updates while allowing the model to 𝘭𝘦𝘒𝘳𝘯 how to organize its
1
1
32
@realJessyLin
Jessy Lin
14 days
On generalization: it's easy to memorize seqs verbatim, but we want models to learn the right "takeaway" from a piece of experience – e.g. when I give feedback to an agent like "You should call X function in the new API," what gradient update should it do? (not next token
1
0
34
@realJessyLin
Jessy Lin
14 days
I think of continual learning as two subproblems: π†πžπ§πžπ«πšπ₯𝐒𝐳𝐚𝐭𝐒𝐨𝐧: given a piece of data (user feedback, a piece of experience, etc.), what update should we do to learn the β€œimportant bits”? π…π¨π«π πžπ­π­π’π§π  / 𝐈𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐒𝐨𝐧: given a piece of data, how
1
0
45
@realJessyLin
Jessy Lin
14 days
@AIatMeta I'm really grateful for my great collaborators!β™₯️ @LukeZettlemoyer @gargighosh @scottyih @aramHmarkosyan @vinceberges @barlas_berkeley πŸ“„Paper: https://t.co/1eHp9OQfEm πŸ“’ Blog post with some broader thoughts on the continual learning problem: https://t.co/HNLqfNsQfN [n/n]
Tweet card summary image
jessylin.com
2
10
101
@realJessyLin
Jessy Lin
14 days
@AIatMeta Looking forward: memory layers introduce a rich design space for architecture/memory design. E.g.: we pretrain memory layers so the model learns to organize its memory on diverse data, but it'd be interesting to add in new slots at finetuning time or investigate connections to
3
1
54
@realJessyLin
Jessy Lin
14 days
@AIatMeta We were curious whether memory slots actually store interpretable concepts. k=32 slots are accessed on each token, but if we visualize just the ones that are trainable with our method, they seem to align with entity boundaries. Beyond this prelim analysis, I think there's a lot
1
1
42
@realJessyLin
Jessy Lin
14 days
@AIatMeta To characterize the tradeoff between learning and forgetting for each method, we can sweep across the hyperparameters that control learning (e.g. rank/alpha and learning rate for LoRA) – sparse mem finetuning dominates. [6/n]
1
1
46
@realJessyLin
Jessy Lin
14 days
@AIatMeta We evaluate on two continual fact learning tasks: learning from a stream of TriviaQA facts and a stream of Wikipedia docs from SimpleQA. Sparse memory finetuning learns just as much as full finetuning and LoRA, but degrades much less on held-out tasks w/ the benefit of selective
1
0
42
@realJessyLin
Jessy Lin
14 days
@AIatMeta Leveraging this sparsity, we propose to update just the memory slots that are specific to a particular input – highly accessed on this input, but not frequently accessed on other data (e.g. pretraining), using TFIDF to rank the slots. This implements memory selectivity. When we
1
0
54
@realJessyLin
Jessy Lin
14 days
@AIatMeta In our work, we show how recently-proposed memory layer architectures provide a potential solution: replace a FFN with a sparse attention lookup into a huge pool of learned memory keys and values. This arch enables granular control over what params we update on each input.
2
2
61
@realJessyLin
Jessy Lin
14 days
@AIatMeta To learn something new, we shouldn’t need to finetune all the parameters of a large model. This motivates parameter-efficient methods for continual learning/memory, like LoRA and Cartridges, which add a small set of params to the model. However, LoRA is inherently low-capacity
1
1
64
@realJessyLin
Jessy Lin
14 days
🧠 How can we equip LLMs with memory that allows them to continually learn new things? In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge. While full
52
299
2K
@a1zhang
Alex L Zhang
20 days
What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,
125
351
3K
@danijarh
Danijar Hafner
1 month
Excited to introduce Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model! πŸŒŽπŸ€– Dreamer 4 pushes the frontier of world model accuracy, speed, and learning complex tasks from offline datasets. co-led with @wilson1yan
82
354
3K
@thinkymachines
Thinking Machines
1 month
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
81
561
3K
@realJessyLin
Jessy Lin
1 month
@NickATomlin In case you missed part 1 - on the current challenges for LLM user simulators: https://t.co/VjZJUB9WWt
@realJessyLin
Jessy Lin
4 months
User simulators bridge RL with real-world interaction // https://t.co/bsrYxVHuVo How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
0
1
1