
Swarnadeep Saha
@swarnaNLP
Followers
2K
Following
1K
Media
57
Statuses
622
Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.
Seattle, Washington
Joined May 2014
Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. ๐งต๐.
๐จ New paper ๐จ.J1: Incentivizing Thinking in LLM-as-a-Judge via RL. - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data. - Optimizes thoughts, scores, and judgments using GRPO. - Outperforms all
2
3
59
RT @jaseweston: . is today a good day for new paper posts? .๐คLearning to Reason for Factuality ๐ค.๐: - New reward fโฆ.
0
46
0
RT @jaseweston: We worked on a whole line of research on this:.- Self-Rewarding LMs (use self as a Judge in semi-online DPO): https://t.coโฆ.
arxiv.org
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of...
0
23
0
I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.
๐ญ๐ Introducing EvalPlanner โ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐:
0
6
63
Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! ๐.
๐ Bridging Offline & Online RL for LLMs ๐.๐: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
1
1
8
RT @dair_ai: 3. J1. Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtโฆ.
0
1
0
RT @johnschulman2: For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you tโฆ.
0
41
0
RT @rohanpaul_ai: Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposโฆ.
0
1
0
RT @TheTuringPost: The freshest research of the week:. Our top 9:.โช๏ธ Beyond 'Aha!'.โช๏ธ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinโฆ.
0
10
0
We're organizing the RAM 2 workshop at COLM 2025 (10 years after the first edition in NeurIPS 2015). Check out our Call of Papers on topics in Reasoning, Attention, and Memory.
๐จAnnouncing RAM 2 workshop @ COLM25 - call for papers๐จ .- 10 years on, we present the sequel to the classic RAM๐ (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the
0
0
5
RT @chenxi_jw: Presenting new work: Thinking LLM-as-a-Judge via RL!. Itโs been great fun working with @swarnaNLP, @jaseweston, @uralik1 andโฆ.
0
1
0
RT @NathanThinks: excellent work by @jaseweston & teamโextending our "Generative Reward Models" work with RL (GRPO) to optimize LLM reasoniโฆ.
0
11
0
Excited to share that EvalPlanner is accepted to #ICML2025! . To make meaningful progress in AI, we need strong evaluators, and specifically those that can reason. Stay tuned for more updates, as we continue to make progress in this space! ๐.
๐ญ๐ Introducing EvalPlanner โ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐:
5
11
82
RT @SomnathBrc: ๐๐จ๐ฐ ๐๐๐ง ๐ฐ๐ ๐ฉ๐๐ซ๐๐๐๐ญ๐ฅ๐ฒ ๐๐ซ๐๐ฌ๐ ๐๐จ๐ง๐๐๐ฉ๐ญ๐ฌ ๐๐ซ๐จ๐ฆ ๐๐๐๐ฌ?. Our method, Perfect Erasure Functions (PEF), erases concepts from LLM repreโฆ.
0
35
0
RT @tesatory: Ten years ago in 2015 we published a paper called End-to-End Memory Networks (. Looking back, this paโฆ.
0
120
0
RT @jaseweston: ๐จMulti-Token Attention๐จ.๐: Attention is critical for LLMs, but its weights are computed by singleโฆ.
0
148
0