Jyo Pari @jyo_pari X Profile

Jyo Pari

@jyo_pari

Followers

2K

Following

755

Media

25

Statuses

121

Working on continual learning | PhD @MIT

Cambridge, MA

Joined December 2021

Don't wanna be here? Send us removal request.

Jyo Pari

@jyo_pari

2 months

What if an LLM could update its own weights?. Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

131

529

3K

Jyo Pari

@jyo_pari

8 hours

RT @a1zhang: announcing the @GPU_MODE x @scaleml summer speaker series happening next week, a 5⃣-day series where top researchers will teac….

0

24

0

Grok

@grok

8 days

Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.

416

687

3K

Jyo Pari

@jyo_pari

8 hours

Finally, this isn't possible without the amazing speakers!. @sewon__min, @Guangxuan_Xiao, @chrismdesa, @SonglinYang4, @simran_s_arora, @exists_forall. and the co-organizers!.@a1zhang, @HanGuo97 , @SonglinYang4, @pulkitology, @yoonrkim, @tri_dao, @lateinteraction.

0

1

Jyo Pari

@jyo_pari

8 hours

from aug 25 (mon) - aug 29 (fri), we dedicate each day to an invited speaker on a specific component of frontier models, e.g. PEs, MoEs, GPU programming, etc. for more details, see . the event will be live-streamed and recorded:

youtube.com

A GPU reading group and community https://discord.gg/gpumode Supplementary content here https://github.com/gpu-mode Created by Mark Saroufim and Andreas Köpf

1

0

1

Jyo Pari

@jyo_pari

8 hours

We have a fun collaboration of @GPU_MODE x @scaleml coming up!. We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS!. For example, how can MoE's power

1

11

34

Jyo Pari

@jyo_pari

1 month

If you are interested in questioning how we should pretrain models and create new architectures for general reasoning . - then checkout E606 @ ICML, our position by @seungwookh and I on potential directions for the next generation reasoning models!

0

5

21

Jyo Pari

@jyo_pari

1 month

MoE Routers are trained a bit strangely but things seem to still work. @minyoung_huh and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! .

2

19

140

Jyo Pari

@jyo_pari

1 month

Current adaptive tokenizers still rely on humans to set the desired fidelity a priori. But what if the model could learn that itself? . The part I like a lot about this paper beyond the high level idea is the way @ShivamDuggal4 trained for this ability. Cudos 🎇!.

Shivam Duggal

@ShivamDuggal4

1 month

Compression is the heart of intelligence.From Occam to Kolmogorov—shorter programs=smarter representations. Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

0

5

Jyo Pari

@jyo_pari

2 months

Thanks @willknight for covering SEAL! Really appreciate the thoughtful and insightful way you captured the work.

WIRED

@WIRED

2 months

Scientists at Massachusetts Institute of Technology have devised a way for large language models to keep learning on the fly—a step toward building AI that continually improves itself.

1

5

13

Jyo Pari

@jyo_pari

2 months

RT @AdamZweiger: An underrated and potentially more practical aspect of our Self-Adapting LMs paper is the potential for general pre/post-t….

0

5

0

Jyo Pari

@jyo_pari

2 months

RT @akyurekekin: There are three types of storage: activations (in-context), external memory, and model weights. If the models will spend….

0

16

0

Jyo Pari

@jyo_pari

2 months

RT @jxmnop: it really is incredible what kinds of things become possible when RL on LLMs works. clearly we’re just getting started.

0

54

0

Jyo Pari

@jyo_pari

2 months

@AdamZweiger and I had an amazing group to help us. Huge thanks to @HanGuo97 and @akyurekekin for the invaluable guidance throughout this project, and to @yoonrkim and @pulkitology for being incredible advisors. Paper: Website:

6

7

79

Jyo Pari

@jyo_pari

2 months

Limitations / Future Work: One of our original motivations was to work towards the ultimate goal of continual learning—think about agents continually self-adapting based on their interactions in an environment. While SEAL doesn't explicitly train for this, we still were curious

4

3

67

Jyo Pari

@jyo_pari

2 months

You may have noticed that generations kept increasing after each round of RL. This is expected since we get more diverse content containing relevant information. Could we just prompt the base model to generate longer sequences instead? We find that prompting for longer

2

0

50

Jyo Pari

@jyo_pari

2 months

Here is an example passage (Input Context) along with SEAL's self-edit generations (Rewrite) and subsequent responses to downstream questions after each round of RL.

1

0

56

Jyo Pari

@jyo_pari

2 months

While RL training is done in the single passage regime, where we can easily quantify the contribution of each self-edit generation, the SEAL model's self-edits are still useful in a continued pretraining setting, where we incorporate many passages in a single update.

1

0

54

Jyo Pari

@jyo_pari

2 months

For incorporating knowledge from a passage into weights, we find that after 2 rounds of RL training, each on a batch of 50 passages, self-editing even matches using synthetic data generated by GPT-4.1.

1

2

64

Jyo Pari

@jyo_pari

2 months

In the few-shot domain, we outperform both ICL and self-edits from the base model, though we still don't reach the optimal human-crafted test-time training (TTT) configuration. Note: these results are from a curated subset that is easier for small LMs.

2

68

Jyo Pari

@jyo_pari

2 months

We explore two settings: (1) incorporating knowledge from a passage, where self-edits are generated text in the form of "implications" of the passage, and (2) adapting to few-shot examples on ARC, where self-edits are tool-calls for data augmentation and optimization params, as

3

2

67