Swarnadeep Saha
@swarnaNLP
Followers
2K
Following
1K
Media
57
Statuses
636
Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.
Seattle, Washington
Joined May 2014
Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. ๐งต๐
๐จ New paper ๐จ J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all
2
2
60
๐ถ๏ธSPICE: Self-Play in Corpus Environments๐ถ๏ธ ๐: https://t.co/ZGBBlV1Vnc - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together โ๏ธ -> automatic curriculum! ๐ฅ Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of
7
55
318
Yi Lin is a brilliant researcher with abundance of knowledge in different aspects of LLM/VLM training. Hire him!
Tough week! I also got impacted less than 3 months after joining. Ironically, I just landed some new RL infra features the day before. Life moves on. My past work spans RL, PEFT, Quantization, and Multimodal LLMs. If your team is working on these areas, Iโd love to connect.
0
4
24
I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, Iโve focused on understanding why/wehere models fail & how to make them better. Iโm looking for opportunities; please reach out!
123
236
3K
Hybrid Reinforcement (HERO): When Reward Is Sparse, Itโs Better to Be Dense ๐ฆธโโ๏ธ ๐ช ๐: https://t.co/VAXtSC4GGp - HERO bridges 0โ1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward
4
53
325
Hope all attendees enjoyed the workshop as much as we did in organizing it!
Was super fun to organize this workshop!! Thanks everyone: speakers, panelists, audience. https://t.co/ccZzIXFgTY
0
0
14
๐จ "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> โจTRAACโจ is an online RL, difficulty-adaptive, attention-based compression method that prunes
๐จ Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking โ: Models terminate reasoning too early on harder problems, leading
1
17
78
(๐งต) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg
60
311
2K
Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!๐
๐New Test-time scaling method ๐ ๐: https://t.co/yqWvOMZpwq - Use RL to train an LLM solution aggregator โ Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. ๐งต1/5
0
0
19
Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. ๐
๐Diversity Aware RL (DARLING)๐ ๐: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks ๐งต1/5
0
0
10
Excited to share that MAgICoRe has been accepted to #EMNLP2025 main! ๐ Our work identifies 3 key challenges in LLM refinement for reasoning: 1) Over-correction on easy problems 2) Fail to localize and fix its own errors 3) Too few refinement iterations for harder problems
Aggregation & refinement improve LLM reasoning, but aggregation saturates, while refinement has 3 issues: 1) over-correction for easy problems 2) fails to localize+fix its own errors 3) insufficient number of refinement iteration for hard problems ๐จMulti-Agent, Iterative,
0
40
97
Great @AIatMeta paper. Builds a single test that shows when LLMs think too much or too little, then scores both. It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones. They release a benchmark called OptimalThinkingBench with
4
13
60
Thank you for building on our overthinking and underthinking research! OptimalThinkingBench provides exactly what the field needs - a unified framework to measure the sweet spot between excessive and insufficient reasoning. The finding that current methods improve one aspect
๐คIntroducing OptimalThinkingBench ๐ค ๐: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
2
2
19
Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! ๐ Work led by the amazing @PranjalAggarw16 during this internship!
๐คIntroducing OptimalThinkingBench ๐ค ๐: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
0
10
79
...is today a good day for new paper posts? ๐คLearning to Reason for Factuality ๐ค ๐: https://t.co/ss09xKGcAm - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across
1
50
386
We worked on a whole line of research on this: - Self-Rewarding LMs (use self as a Judge in semi-online DPO): https://t.co/YZRuFphh0H - Thinking LLMs (learn CoTs with a Judge with semi-online DPO): https://t.co/WVpz7mF7qH *poster at ICML this week!!* - Mix verifiable &
arxiv.org
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of...
0
23
199
I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.
๐ญ๐ Introducing EvalPlanner โ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐: https://t.co/5xySXdlFx9
0
6
63
Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! ๐
๐ Bridging Offline & Online RL for LLMs ๐ ๐: https://t.co/G12TS6Z84n New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
1
1
8
3. J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. https://t.co/K19KD6cKlG
๐จ New paper ๐จ J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all
1
1
13
For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to
123
41
714
Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias. Methods ๐ง: โ Convert judgment tasks, even
0
1
10