Jason Weston @jaseweston X Profile

Jason Weston

@jaseweston

Followers

13K

Following

2K

Media

168

Statuses

437

@MetaAI+NYU. NLP from scratch(Pretrain+FT LLM) 2008, MemNets (pre-Transformer) 2015, DrQA(pre-RAG) 2017, BlenderBot(dialog pre-ChatGPT) 2018+,Self-Reward+ more!

NYC

Joined April 2008

Don't wanna be here? Send us removal request.

Jason Weston

@jaseweston

2 days

RT @danieljwkim: Can we improve Llama 3’s reasoning abilities through post-training only?.Introducing ASTRO, our new framework that teaches….

0

46

0

Jason Weston

@jaseweston

4 days

🔁 We also find that distillation based on reasoning difficulty can improve the pareto frontier of the student model’s inference efficiency. - Training with a mix of full reasoning traces and the condensed answers enables efficient hybrid reasoning in the student model, by

0

18

Jason Weston

@jaseweston

4 days

🍷 Inspired by Sober Reasoning, we conduct robust evaluation, e.g. using 24 seeds for GPQA-Diamond, 16 seeds for MATH-500, and verify the results on different model families (Llama and Qwen) and model sizes (8B to 70B). - We find that NaturalThoughts outperforms state-of-the-art

1

0

17

Jason Weston

@jaseweston

4 days

🌿Introducing NaturalThoughts 🌿. 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in

1

74

401

Jason Weston

@jaseweston

7 days

Finally, we train jointly verifiable tasks with rule-based rewards & non-verifiable tasks with reward models. - This gives improved average results across all tasks compared to optimizing only one objective. - It also improves non-verifiable evaluations compared to only

0

1

12

Jason Weston

@jaseweston

7 days

We find similar results on verifiable tasks as well. Semi-online DPO even performs a little bit better on some tasks. 🧵3/4

1

9

Jason Weston

@jaseweston

7 days

- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature. - Online DPO achieves comparable performance to online GRPO. - But more surprisingly so does semi-online DPO. 🧵2/4

1

0

6

Jason Weston

@jaseweston

7 days

🌉 Bridging Offline & Online RL for LLMs 🌉.📝: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

1

96

445

Jason Weston

@jaseweston

11 days

RT @thao_nguyen26: Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔.We propose Recycling the Web to break the data wall….

0

59

0

Jason Weston

@jaseweston

14 days

Reasoning, Attention & Memory Workshop @ COLM.Submission Deadline: June 23, 2025 -- Today!.

Jason Weston

@jaseweston

2 months

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 .- 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

3

15

65

Jason Weston

@jaseweston

1 month

RT @chhaviyadav_: Upon graduation, I paused to reflect on what my PhD had truly taught me. Was it just how to write papers, respond to brut….

0

40

0

Jason Weston

@jaseweston

1 month

RT @YifeiZhou02: 📢 New Preprint: Self-Challenging Agent (SCA) 📢. It’s costly to scale agent tasks with reliable verifiers. In SCA, the key….

0

36

0

Jason Weston

@jaseweston

1 month

RT @tesatory: The idea of challenging yourself has a long history, e.g. our Asymmetric Self-Play paper

0

2

0

Jason Weston

@jaseweston

1 month

SCA also demonstrates strong potential for scaling, where we found scaling the number of self-synthesized tasks is more effective than the number of trajectories for out-of-distribution generalization. Thanks for reading! Check the paper for more: 🧵4/4

0

14

Jason Weston

@jaseweston

1 month

During RL training, SCA generates solutions & assigns rewards using the self-synthesized evaluation function. Experiments show it doubles performance over the previous SoTA method on self-improvement. It also achieves significant improvements in the distillation setting. 🧵3/4

1

0

10

Jason Weston

@jaseweston

1 month

SCA creates new tasks in our “Code-as-Task” formalism. It gathers info by interacting with the environment & then generates a task instruction, verification function, example solution and failure cases. It then tests by code execution to keep only high quality tasks. 🧵2/4

1

0

10

Jason Weston

@jaseweston

1 month

🚨Self-Challenging Language Model Agents🚨.📝: A new paradigm to train LLM agents to use different tools with challenging self-generated data ONLY: Self-challenging agents (SCA) both propose new tasks and solve them, using self-generated verifiers to

2

110

522

Jason Weston

@jaseweston

1 month

Interesting work! Also provides additional evidence that our ScPO (Self-Consistency Preference Optimization) direction (Maj vote-based rewards, see fig below) works quite well -- without any labels. @ArchikiPrasad.

Stella Li

@StellaLisy

1 month

🤯 We cracked RLVR with. Random Rewards?!.Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:.- Random rewards: +21%.- Incorrect rewards: +25%.- (FYI) Ground-truth rewards: + 28.8%.How could this even work⁉️ Here's why: 🧵.Blogpost:

0

4

32

Jason Weston

@jaseweston

2 months

RT @kchonyc: oh . what a memory! thanks, @jaseweston et al., for organizing this sequel!. looking back to my slide deck then . i was righ….

0

4

0