swarnaNLP Profile Banner
Swarnadeep Saha Profile
Swarnadeep Saha

@swarnaNLP

Followers
2K
Following
1K
Media
57
Statuses
622

Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.

Seattle, Washington
Joined May 2014
Don't wanna be here? Send us removal request.
@swarnaNLP
Swarnadeep Saha
3 months
Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. ๐Ÿงต๐Ÿ‘‡.
@jaseweston
Jason Weston
3 months
๐Ÿšจ New paper ๐Ÿšจ.J1: Incentivizing Thinking in LLM-as-a-Judge via RL. - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data. - Optimizes thoughts, scores, and judgments using GRPO. - Outperforms all
Tweet media one
2
3
59
@swarnaNLP
Swarnadeep Saha
1 day
RT @jaseweston: . is today a good day for new paper posts? .๐Ÿค–Learning to Reason for Factuality ๐Ÿค–.๐Ÿ“: - New reward fโ€ฆ.
0
46
0
@swarnaNLP
Swarnadeep Saha
28 days
I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.
@jaseweston
Jason Weston
6 months
๐Ÿ’ญ๐Ÿ”Ž Introducing EvalPlanner โ€“ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐Ÿ“„:
Tweet media one
0
6
63
@swarnaNLP
Swarnadeep Saha
1 month
Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! ๐Ÿ‘‡.
@jaseweston
Jason Weston
1 month
๐ŸŒ‰ Bridging Offline & Online RL for LLMs ๐ŸŒ‰.๐Ÿ“: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
Tweet media one
1
1
8
@swarnaNLP
Swarnadeep Saha
2 months
RT @dair_ai: 3. J1. Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtโ€ฆ.
0
1
0
@swarnaNLP
Swarnadeep Saha
3 months
RT @johnschulman2: For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you tโ€ฆ.
0
41
0
@swarnaNLP
Swarnadeep Saha
3 months
RT @rohanpaul_ai: Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposโ€ฆ.
0
1
0
@swarnaNLP
Swarnadeep Saha
3 months
RT @TheTuringPost: The freshest research of the week:. Our top 9:.โ–ช๏ธ Beyond 'Aha!'.โ–ช๏ธ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinโ€ฆ.
0
10
0
@swarnaNLP
Swarnadeep Saha
3 months
We're organizing the RAM 2 workshop at COLM 2025 (10 years after the first edition in NeurIPS 2015). Check out our Call of Papers on topics in Reasoning, Attention, and Memory.
@jaseweston
Jason Weston
3 months
๐ŸšจAnnouncing RAM 2 workshop @ COLM25 - call for papers๐Ÿšจ .- 10 years on, we present the sequel to the classic RAM๐Ÿ (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the
Tweet media one
Tweet media two
0
0
5
@swarnaNLP
Swarnadeep Saha
3 months
RT @chenxi_jw: Presenting new work: Thinking LLM-as-a-Judge via RL!. Itโ€™s been great fun working with @swarnaNLP, @jaseweston, @uralik1 andโ€ฆ.
0
1
0
@swarnaNLP
Swarnadeep Saha
3 months
RT @NathanThinks: excellent work by @jaseweston & teamโ€”extending our "Generative Reward Models" work with RL (GRPO) to optimize LLM reasoniโ€ฆ.
0
11
0
@swarnaNLP
Swarnadeep Saha
3 months
Check out our paper for more analysis and ablations, including:. - score distribution of Pointwise-J1 models. - different reward schemes. - different seed thinking prompts. - reward+thought lengths. Fun Fact: Even before the project started, we knew what we wanted to call it ๐Ÿ˜€
Tweet media one
Tweet media two
0
0
1
@swarnaNLP
Swarnadeep Saha
3 months
Next, Pointwise-J1 at 8B + 70B scales:. 1โƒฃ Mitigates position bias. 2โƒฃ Improves position-consistent accuracy. 3โƒฃ Reduces ties in pairwise judgments. Finally, test-time scaling of J1 leads to further improvements, for both Pairwise and Pointwise models, at both scales
Tweet media one
1
0
0
@swarnaNLP
Swarnadeep Saha
3 months
We tested J1 on 5 benchmarks w/ verifiable + non-verifiable + multilingual instructions at 8B+70B scales. First, Pairwise-J1 outperforms:. 1โƒฃ open + closed LLM judges . 2โƒฃ SOTA scalar + generative RMs. 3โƒฃ R1-distilled-Llama + o1-mini. 4โƒฃ a much larger R1 on non-verifiable tasks
Tweet media one
1
0
1
@swarnaNLP
Swarnadeep Saha
3 months
๐Ÿง‘โ€๐Ÿณ J1 Recipe:. -- Generate synthetic preference pairs as training data for both verifiable+non-verifiable tasks. -- Train Pairwise-J1 using GRPO with verdict correctness+consistency rewards. -- Train Pointwise-J1 using GRPO with distant pairwise supervision + score-based rewards
Tweet media one
1
0
2
@swarnaNLP
Swarnadeep Saha
3 months
Excited to share that EvalPlanner is accepted to #ICML2025! . To make meaningful progress in AI, we need strong evaluators, and specifically those that can reason. Stay tuned for more updates, as we continue to make progress in this space! ๐Ÿ˜€.
@jaseweston
Jason Weston
6 months
๐Ÿ’ญ๐Ÿ”Ž Introducing EvalPlanner โ€“ a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper ๐Ÿ“„:
Tweet media one
5
11
82
@swarnaNLP
Swarnadeep Saha
3 months
RT @SomnathBrc: ๐‡๐จ๐ฐ ๐œ๐š๐ง ๐ฐ๐ž ๐ฉ๐ž๐ซ๐Ÿ๐ž๐œ๐ญ๐ฅ๐ฒ ๐ž๐ซ๐š๐ฌ๐ž ๐œ๐จ๐ง๐œ๐ž๐ฉ๐ญ๐ฌ ๐Ÿ๐ซ๐จ๐ฆ ๐‹๐‹๐Œ๐ฌ?. Our method, Perfect Erasure Functions (PEF), erases concepts from LLM repreโ€ฆ.
0
35
0
@swarnaNLP
Swarnadeep Saha
4 months
RT @tesatory: Ten years ago in 2015 we published a paper called End-to-End Memory Networks (. Looking back, this paโ€ฆ.
0
120
0
@swarnaNLP
Swarnadeep Saha
4 months
RT @jaseweston: ๐ŸšจMulti-Token Attention๐Ÿšจ.๐Ÿ“: Attention is critical for LLMs, but its weights are computed by singleโ€ฆ.
0
148
0