jyo_pari Profile Banner
Jyo Pari Profile
Jyo Pari

@jyo_pari

Followers
3K
Following
810
Media
37
Statuses
149

Working on continual learning | PhD @MIT

Boston
Joined December 2021
Don't wanna be here? Send us removal request.
@jyo_pari
Jyo Pari
6 months
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
135
529
3K
@jyo_pari
Jyo Pari
19 days
Next Tuesday, @shannonzshen will present hybrid chain-of-thought, a method that mixes latent and discrete tokens during decoding 🔥 🗓️ Nov 25, 3pm ET @scaleml
1
7
50
@jyo_pari
Jyo Pari
27 days
Why do deep learning optimizers make progress even in the edge-of-stability regime? 🤔 @alex_damian_ will present theory that can describe the dynamics of optimization in this regime! 🗓️ Nov 17, 3pm ET @scaleml
0
10
72
@IdanShenfeld
idan shenfeld
1 month
Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is:
12
47
478
@_kevinlu
Kevin Lu
1 month
in our new post, we walk through great prior work from @agarwl_ & the @Alibaba_Qwen team exploring on-policy distillation using an open source recipe: you can run our experiments on Tinker today! https://t.co/7pVk87qTDH i'm especially excited by the use of on-policy
@thinkymachines
Thinking Machines
1 month
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
13
24
323
@jyo_pari
Jyo Pari
2 months
Very interest! We could use RLMs for complex reasoning problems where models are solving sub-problems in parallel unlocking a new dimension of scaling!
@a1zhang
Alex L Zhang
2 months
What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,
1
4
30
@moritz_reuss
Moritz Reuss
2 months
VLAs have become the fastest-growing subfield in robot learning. So where are we now? After reviewing ICLR 2026 submissions and conversations at CoRL, I wrote an overview of the current state of VLA research with some personal takes: https://t.co/OMMdB1MHtS
Tweet card summary image
mbreuss.github.io
Comprehensive analysis of Vision-Language-Action models at ICLR 2026 - discrete diffusion, reasoning VLAs, and benchmark insights.
11
102
534
@jyo_pari
Jyo Pari
2 months
After weeks of learning about systems at @scaleml, we’re shifting gears to video foundation models. Thrilled to have @cloneofsimo sharing how to train them from scratch next Tuesday — no better person to learn from 🔥
5
11
128
@jyo_pari
Jyo Pari
2 months
Next Tuesday, @scaleml hosts @kavnwang & Kristine Lu for a tutorial based on https://t.co/XKCnl7lUpy 🚀 They'll cover distributed training/inference of large models, plus the math & tradeoffs of latency, throughput, and model size in GPU comms!
2
14
128
@jyo_pari
Jyo Pari
2 months
A great read 👏
@jxbz
Jeremy Bernstein
3 months
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number https://t.co/EhhKN2Jylx
0
0
4
@abeirami
Ahmad Beirami
3 months
This great work co-led by @IdanShenfeld and @jyo_pari shows that online RL leads to less forgetting because it inherently leads to a solution with a small reverse KL divergence! I'll try to discuss the significance of the result: 🧵
@jyo_pari
Jyo Pari
3 months
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
2
3
33
@HanGuo97
Han Guo
3 months
Very interesting work! When I first learned about it earlier this summer, I became curious about the possible explanations for (catastrophic) forgetting during continual learning in SFT versus RL. It seems they've figured something out now :)
@jyo_pari
Jyo Pari
3 months
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
0
5
29
@jyo_pari
Jyo Pari
3 months
Finally, two open questions still remain: 1️⃣ Why does less KL divergence evaluated on the new task correspond to less forgetting? 2️⃣ Can we design new fine-tuning methods that combine the simplicity and efficiency of SFT with the implicit bias of RL’s Razor?
5
2
35
@jyo_pari
Jyo Pari
3 months
In addition to the empirical findings, we were able to provide some theory that reveals why RL has an implicit bias towards minimal KL solutions. The animation shows how alternating projections between optimal policies (green) and the model class (blue) converges to min-KL
2
0
28
@jyo_pari
Jyo Pari
3 months
Then we ask, why is RL resulting in a smaller KL and therefore less forgetting? Grounded in our empirical results, we introduce the principle we call RL’s Razor 🗡️: Among the many high-reward solutions for a new task, on-policy methods such as RL are inherently biased toward
2
3
46
@jyo_pari
Jyo Pari
3 months
To understand why, we looked for a variable that would explain forgetting across methods and hyperparameters. After searching, we find that the KL divergence from the base model, evaluated on the new task, is almost a perfect predictor of forgetting!
2
4
44
@jyo_pari
Jyo Pari
3 months
We fully swept hyperparameters for both methods and plotted the Pareto Frontier. Our finding holds for LLMs, Robotics foundation models, and even 3-layer MLP:
1
0
34
@jyo_pari
Jyo Pari
3 months
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
20
147
911
@a1zhang
Alex L Zhang
3 months
All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁 There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers! Plz take advantage of it!
7
104
650