swarnaNLP Profile Banner
Swarnadeep Saha Profile
Swarnadeep Saha

@swarnaNLP

Followers
2K
Following
1K
Media
57
Statuses
636

Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.

Seattle, Washington
Joined May 2014
Don't wanna be here? Send us removal request.
@swarnaNLP
Swarnadeep Saha
6 months
Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. ๐Ÿงต๐Ÿ‘‡
@jaseweston
Jason Weston
6 months
๐Ÿšจ New paper ๐Ÿšจ J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all
2
2
60
@jaseweston
Jason Weston
12 days
๐ŸŒถ๏ธSPICE: Self-Play in Corpus Environments๐ŸŒถ๏ธ ๐Ÿ“: https://t.co/ZGBBlV1Vnc - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together โš”๏ธ -> automatic curriculum! ๐Ÿ”ฅ Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of
7
55
318
@swarnaNLP
Swarnadeep Saha
17 days
Yi Lin is a brilliant researcher with abundance of knowledge in different aspects of LLM/VLM training. Hire him!
@yilin_sung
Yi Lin Sung
17 days
Tough week! I also got impacted less than 3 months after joining. Ironically, I just landed some new RL infra features the day before. Life moves on. My past work spans RL, PEFT, Quantization, and Multimodal LLMs. If your team is working on these areas, Iโ€™d love to connect.
0
4
24
@MimansaJ
Mimansa Jaiswal
18 days
I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, Iโ€™ve focused on understanding why/wehere models fail & how to make them better. Iโ€™m looking for opportunities; please reach out!
@suchenzang
Susan Zhang
18 days
๐Ÿ‘€
123
236
3K
@jaseweston
Jason Weston
27 days
Hybrid Reinforcement (HERO): When Reward Is Sparse, Itโ€™s Better to Be Dense ๐Ÿฆธโ€โ™‚๏ธ ๐Ÿ’ช ๐Ÿ“: https://t.co/VAXtSC4GGp - HERO bridges 0โ€“1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward
4
53
325
@swarnaNLP
Swarnadeep Saha
28 days
Hope all attendees enjoyed the workshop as much as we did in organizing it!
@jaseweston
Jason Weston
28 days
Was super fun to organize this workshop!! Thanks everyone: speakers, panelists, audience. https://t.co/ccZzIXFgTY
0
0
14
@mohitban47
Mohit Bansal
1 month
๐Ÿšจ "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> โœจTRAACโœจ is an online RL, difficulty-adaptive, attention-based compression method that prunes
@joykiratsingh
Joykirat
1 month
๐Ÿšจ Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking โŒ: Models terminate reasoning too early on harder problems, leading
1
17
78
@syhw
Gabriel Synnaeve
2 months
(๐Ÿงต) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg
60
311
2K
@swarnaNLP
Swarnadeep Saha
2 months
Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!๐Ÿ‘‡
@jaseweston
Jason Weston
2 months
๐ŸŒ€New Test-time scaling method ๐ŸŒ€ ๐Ÿ“: https://t.co/yqWvOMZpwq - Use RL to train an LLM solution aggregator โ€“ Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. ๐Ÿงต1/5
0
0
19
@swarnaNLP
Swarnadeep Saha
2 months
Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. ๐Ÿ‘‡
@jaseweston
Jason Weston
2 months
๐ŸŒ€Diversity Aware RL (DARLING)๐ŸŒ€ ๐Ÿ“: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks ๐Ÿงต1/5
0
0
10
@cyjustinchen
Justin Chih-Yao Chen
3 months
Excited to share that MAgICoRe has been accepted to #EMNLP2025 main! ๐ŸŽ‰ Our work identifies 3 key challenges in LLM refinement for reasoning: 1) Over-correction on easy problems 2) Fail to localize and fix its own errors 3) Too few refinement iterations for harder problems
@cyjustinchen
Justin Chih-Yao Chen
1 year
Aggregation & refinement improve LLM reasoning, but aggregation saturates, while refinement has 3 issues: 1) over-correction for easy problems 2) fails to localize+fix its own errors 3) insufficient number of refinement iteration for hard problems ๐ŸšจMulti-Agent, Iterative,
0
40
97
@rohanpaul_ai
Rohan Paul
3 months
Great @AIatMeta paper. Builds a single test that shows when LLMs think too much or too little, then scores both. It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones. They release a benchmark called OptimalThinkingBench with
4
13
60
@tuzhaopeng
Zhaopeng Tu
3 months
Thank you for building on our overthinking and underthinking research! OptimalThinkingBench provides exactly what the field needs - a unified framework to measure the sweet spot between excessive and insufficient reasoning. The finding that current methods improve one aspect
@jaseweston
Jason Weston
3 months
๐Ÿค–Introducing OptimalThinkingBench ๐Ÿค– ๐Ÿ“: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
2
2
19
@swarnaNLP
Swarnadeep Saha
3 months
Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! ๐Ÿ‘‡ Work led by the amazing @PranjalAggarw16 during this internship!
@jaseweston
Jason Weston
3 months
๐Ÿค–Introducing OptimalThinkingBench ๐Ÿค– ๐Ÿ“: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
0
10
79
@jaseweston
Jason Weston
3 months
...is today a good day for new paper posts? ๐Ÿค–Learning to Reason for Factuality ๐Ÿค– ๐Ÿ“: https://t.co/ss09xKGcAm - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across
1
50
386
@jaseweston
Jason Weston
4 months
We worked on a whole line of research on this: - Self-Rewarding LMs (use self as a Judge in semi-online DPO): https://t.co/YZRuFphh0H - Thinking LLMs (learn CoTs with a Judge with semi-online DPO): https://t.co/WVpz7mF7qH *poster at ICML this week!!* - Mix verifiable &
Tweet card summary image
arxiv.org
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of...
@Grad62304977
Grad
4 months
Still surprised this doesnโ€™t start reward hacking
0
23
199
@swarnaNLP
Swarnadeep Saha
4 months
I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.
@jaseweston
Jason Weston
9 months
๐Ÿ’ญ๐Ÿ”Ž Introducing EvalPlanner โ€“ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐Ÿ“„: https://t.co/5xySXdlFx9
0
6
63
@swarnaNLP
Swarnadeep Saha
4 months
Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! ๐Ÿ‘‡
@jaseweston
Jason Weston
4 months
๐ŸŒ‰ Bridging Offline & Online RL for LLMs ๐ŸŒ‰ ๐Ÿ“: https://t.co/G12TS6Z84n New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
1
1
8
@dair_ai
DAIR.AI
6 months
3. J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. https://t.co/K19KD6cKlG
@jaseweston
Jason Weston
6 months
๐Ÿšจ New paper ๐Ÿšจ J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all
1
1
13
@johnschulman2
John Schulman
6 months
For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to
123
41
714
@rohanpaul_ai
Rohan Paul
6 months
Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias. Methods ๐Ÿ”ง: โ†’ Convert judgment tasks, even
0
1
10