jaseweston Profile Banner
Jason Weston Profile
Jason Weston

@jaseweston

Followers
13K
Following
2K
Media
168
Statuses
437

@MetaAI+NYU. NLP from scratch(Pretrain+FT LLM) 2008, MemNets (pre-Transformer) 2015, DrQA(pre-RAG) 2017, BlenderBot(dialog pre-ChatGPT) 2018+,Self-Reward+ more!

NYC
Joined April 2008
Don't wanna be here? Send us removal request.
@jaseweston
Jason Weston
2 days
RT @danieljwkim: Can we improve Llama 3’s reasoning abilities through post-training only?.Introducing ASTRO, our new framework that teaches….
0
46
0
@jaseweston
Jason Weston
4 days
🔁 We also find that distillation based on reasoning difficulty can improve the pareto frontier of the student model’s inference efficiency. - Training with a mix of full reasoning traces and the condensed answers enables efficient hybrid reasoning in the student model, by
Tweet media one
0
0
18
@jaseweston
Jason Weston
4 days
🍷 Inspired by Sober Reasoning, we conduct robust evaluation, e.g. using 24 seeds for GPQA-Diamond, 16 seeds for MATH-500, and verify the results on different model families (Llama and Qwen) and model sizes (8B to 70B). - We find that NaturalThoughts outperforms state-of-the-art
Tweet media one
1
0
17
@jaseweston
Jason Weston
4 days
🌿Introducing NaturalThoughts 🌿. 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in
Tweet media one
1
74
401
@jaseweston
Jason Weston
7 days
Finally, we train jointly verifiable tasks with rule-based rewards & non-verifiable tasks with reward models. - This gives improved average results across all tasks compared to optimizing only one objective. - It also improves non-verifiable evaluations compared to only
Tweet media one
0
1
12
@jaseweston
Jason Weston
7 days
We find similar results on verifiable tasks as well. Semi-online DPO even performs a little bit better on some tasks. 🧵3/4
Tweet media one
1
1
9
@jaseweston
Jason Weston
7 days
- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature. - Online DPO achieves comparable performance to online GRPO. - But more surprisingly so does semi-online DPO. 🧵2/4
Tweet media one
1
0
6
@jaseweston
Jason Weston
7 days
🌉 Bridging Offline & Online RL for LLMs 🌉.📝: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
Tweet media one
1
96
445
@jaseweston
Jason Weston
11 days
RT @thao_nguyen26: Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔.We propose Recycling the Web to break the data wall….
0
59
0
@jaseweston
Jason Weston
14 days
Reasoning, Attention & Memory Workshop @ COLM.Submission Deadline: June 23, 2025 -- Today!.
@jaseweston
Jason Weston
2 months
🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 .- 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the
Tweet media one
Tweet media two
3
15
65
@jaseweston
Jason Weston
1 month
RT @chhaviyadav_: Upon graduation, I paused to reflect on what my PhD had truly taught me. Was it just how to write papers, respond to brut….
0
40
0
@jaseweston
Jason Weston
1 month
RT @YifeiZhou02: 📢 New Preprint: Self-Challenging Agent (SCA) 📢. It’s costly to scale agent tasks with reliable verifiers. In SCA, the key….
0
36
0
@jaseweston
Jason Weston
1 month
RT @tesatory: The idea of challenging yourself has a long history, e.g. our Asymmetric Self-Play paper
0
2
0
@jaseweston
Jason Weston
1 month
SCA also demonstrates strong potential for scaling, where we found scaling the number of self-synthesized tasks is more effective than the number of trajectories for out-of-distribution generalization. Thanks for reading! Check the paper for more: 🧵4/4
Tweet media one
0
0
14
@jaseweston
Jason Weston
1 month
During RL training, SCA generates solutions & assigns rewards using the self-synthesized evaluation function. Experiments show it doubles performance over the previous SoTA method on self-improvement. It also achieves significant improvements in the distillation setting. 🧵3/4
Tweet media one
1
0
10
@jaseweston
Jason Weston
1 month
SCA creates new tasks in our “Code-as-Task” formalism. It gathers info by interacting with the environment & then generates a task instruction, verification function, example solution and failure cases. It then tests by code execution to keep only high quality tasks. 🧵2/4
Tweet media one
1
0
10
@jaseweston
Jason Weston
1 month
🚨Self-Challenging Language Model Agents🚨.📝: A new paradigm to train LLM agents to use different tools with challenging self-generated data ONLY: Self-challenging agents (SCA) both propose new tasks and solve them, using self-generated verifiers to
Tweet media one
2
110
522
@jaseweston
Jason Weston
1 month
Interesting work! Also provides additional evidence that our ScPO (Self-Consistency Preference Optimization) direction (Maj vote-based rewards, see fig below) works quite well -- without any labels. @ArchikiPrasad.
@StellaLisy
Stella Li
1 month
🤯 We cracked RLVR with. Random Rewards?!.Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:.- Random rewards: +21%.- Incorrect rewards: +25%.- (FYI) Ground-truth rewards: + 28.8%.How could this even work⁉️ Here's why: 🧵.Blogpost:
Tweet media one
0
4
32
@jaseweston
Jason Weston
2 months
RT @kchonyc: oh . what a memory! thanks, @jaseweston et al., for organizing this sequel!. looking back to my slide deck then . i was righ….
0
4
0