Swarnadeep Saha @swarnaNLP X Profile

Swarnadeep Saha

@swarnaNLP

Followers

2K

Following

1K

Media

57

Statuses

636

Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.

https://t.co/VrycF4jZLn

Seattle, Washington

Joined May 2014

Don't wanna be here? Send us removal request.

Swarnadeep Saha

@swarnaNLP

6 months

Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. 🧵👇

Jason Weston

@jaseweston

6 months

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all

2

60

Jason Weston

@jaseweston

12 days

🌶️SPICE: Self-Play in Corpus Environments🌶️ 📝: https://t.co/ZGBBlV1Vnc - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together ⚔️ -> automatic curriculum! 🔥 Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of

7

55

318

Swarnadeep Saha

@swarnaNLP

17 days

Yi Lin is a brilliant researcher with abundance of knowledge in different aspects of LLM/VLM training. Hire him!

Yi Lin Sung

@yilin_sung

17 days

Tough week! I also got impacted less than 3 months after joining. Ironically, I just landed some new RL infra features the day before. Life moves on. My past work spans RL, PEFT, Quantization, and Multimodal LLMs. If your team is working on these areas, I’d love to connect.

0

4

24

Mimansa Jaiswal

@MimansaJ

18 days

I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!

Susan Zhang

@suchenzang

18 days

👀

123

236

3K

Jason Weston

@jaseweston

27 days

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: https://t.co/VAXtSC4GGp - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward

4

53

325

Swarnadeep Saha

@swarnaNLP

28 days

Hope all attendees enjoyed the workshop as much as we did in organizing it!

Jason Weston

@jaseweston

28 days

Was super fun to organize this workshop!! Thanks everyone: speakers, panelists, audience. https://t.co/ccZzIXFgTY

0

14

Mohit Bansal

@mohitban47

1 month

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes

Joykirat

@joykiratsingh

1 month

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking ❌: Models terminate reasoning too early on harder problems, leading

1

17

78

Gabriel Synnaeve

@syhw

2 months

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg

60

311

2K

Swarnadeep Saha

@swarnaNLP

2 months

Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!👇

Jason Weston

@jaseweston

2 months

🌀New Test-time scaling method 🌀 📝: https://t.co/yqWvOMZpwq - Use RL to train an LLM solution aggregator – Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. 🧵1/5

0

19

Swarnadeep Saha

@swarnaNLP

2 months

Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. 👇

Jason Weston

@jaseweston

2 months

🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

0

10

Justin Chih-Yao Chen

@cyjustinchen

3 months

Excited to share that MAgICoRe has been accepted to #EMNLP2025 main! 🎉 Our work identifies 3 key challenges in LLM refinement for reasoning: 1) Over-correction on easy problems 2) Fail to localize and fix its own errors 3) Too few refinement iterations for harder problems

Justin Chih-Yao Chen

@cyjustinchen

1 year

Aggregation & refinement improve LLM reasoning, but aggregation saturates, while refinement has 3 issues: 1) over-correction for easy problems 2) fails to localize+fix its own errors 3) insufficient number of refinement iteration for hard problems 🚨Multi-Agent, Iterative,

0

40

97

Rohan Paul

@rohanpaul_ai

3 months

Great @AIatMeta paper. Builds a single test that shows when LLMs think too much or too little, then scores both. It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones. They release a benchmark called OptimalThinkingBench with

4

13

60

Zhaopeng Tu

@tuzhaopeng

3 months

Thank you for building on our overthinking and underthinking research! OptimalThinkingBench provides exactly what the field needs - a unified framework to measure the sweet spot between excessive and insufficient reasoning. The finding that current methods improve one aspect

Jason Weston

@jaseweston

3 months

🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1

2

19

Swarnadeep Saha

@swarnaNLP

3 months

Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! 👇 Work led by the amazing @PranjalAggarw16 during this internship!

Jason Weston

@jaseweston

3 months

🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1

0

10

79

Jason Weston

@jaseweston

3 months

...is today a good day for new paper posts? 🤖Learning to Reason for Factuality 🤖 📝: https://t.co/ss09xKGcAm - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across

1

50

386

Jason Weston

@jaseweston

4 months

We worked on a whole line of research on this: - Self-Rewarding LMs (use self as a Judge in semi-online DPO): https://t.co/YZRuFphh0H - Thinking LLMs (learn CoTs with a Judge with semi-online DPO): https://t.co/WVpz7mF7qH *poster at ICML this week!!* - Mix verifiable &

arxiv.org

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of...

Grad

@Grad62304977

4 months

Still surprised this doesn’t start reward hacking

0

23

199

Swarnadeep Saha

@swarnaNLP

4 months

I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.

Jason Weston

@jaseweston

9 months

💭🔎 Introducing EvalPlanner – a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper 📄: https://t.co/5xySXdlFx9

0

6

63

Swarnadeep Saha

@swarnaNLP

4 months

Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! 👇

Jason Weston

@jaseweston

4 months

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: https://t.co/G12TS6Z84n New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

1

8

DAIR.AI

@dair_ai

6 months

3. J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. https://t.co/K19KD6cKlG

Jason Weston

@jaseweston

6 months

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all

1

13

John Schulman

@johnschulman2

6 months

For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to

123

41

714

Rohan Paul

@rohanpaul_ai

6 months

Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias. Methods 🔧: → Convert judgment tasks, even

0

1

10