Swarnadeep Saha @swarnaNLP X Profile

Swarnadeep Saha

@swarnaNLP

Followers

2K

Following

1K

Media

57

Statuses

622

Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.

Seattle, Washington

Joined May 2014

Don't wanna be here? Send us removal request.

Swarnadeep Saha

@swarnaNLP

3 months

Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. 🧵👇.

Jason Weston

@jaseweston

3 months

🚨 New paper 🚨.J1: Incentivizing Thinking in LLM-as-a-Judge via RL. - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data. - Optimizes thoughts, scores, and judgments using GRPO. - Outperforms all

2

3

59

Swarnadeep Saha

@swarnaNLP

1 day

RT @jaseweston: . is today a good day for new paper posts? .🤖Learning to Reason for Factuality 🤖.📝: - New reward f….

0

46

0

Swarnadeep Saha

@swarnaNLP

26 days

RT @jaseweston: We worked on a whole line of research on this:.- Self-Rewarding LMs (use self as a Judge in semi-online DPO): https://t.co….

arxiv.org

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of...

0

23

0

Swarnadeep Saha

@swarnaNLP

28 days

I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.

Jason Weston

@jaseweston

6 months

💭🔎 Introducing EvalPlanner – a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper 📄:

0

6

63

Swarnadeep Saha

@swarnaNLP

1 month

Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! 👇.

Jason Weston

@jaseweston

1 month

🌉 Bridging Offline & Online RL for LLMs 🌉.📝: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

1

8

Swarnadeep Saha

@swarnaNLP

2 months

RT @dair_ai: 3. J1. Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thought….

0

1

0

Swarnadeep Saha

@swarnaNLP

3 months

RT @johnschulman2: For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you t….

0

41

0

Swarnadeep Saha

@swarnaNLP

3 months

RT @rohanpaul_ai: Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper propos….

0

1

0

Swarnadeep Saha

@swarnaNLP

3 months

RT @TheTuringPost: The freshest research of the week:. Our top 9:.▪️ Beyond 'Aha!'.▪️ J1: Incentivizing Thinking in LLM-as-a-Judge via Rein….

0

10

0

Swarnadeep Saha

@swarnaNLP

3 months

We're organizing the RAM 2 workshop at COLM 2025 (10 years after the first edition in NeurIPS 2015). Check out our Call of Papers on topics in Reasoning, Attention, and Memory.

Jason Weston

@jaseweston

3 months

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 .- 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

0

5

Swarnadeep Saha

@swarnaNLP

3 months

RT @chenxi_jw: Presenting new work: Thinking LLM-as-a-Judge via RL!. It’s been great fun working with @swarnaNLP, @jaseweston, @uralik1 and….

0

1

0

Swarnadeep Saha

@swarnaNLP

3 months

RT @NathanThinks: excellent work by @jaseweston & team—extending our "Generative Reward Models" work with RL (GRPO) to optimize LLM reasoni….

0

11

0

Swarnadeep Saha

@swarnaNLP

3 months

Check out our paper for more analysis and ablations, including:. - score distribution of Pointwise-J1 models. - different reward schemes. - different seed thinking prompts. - reward+thought lengths. Fun Fact: Even before the project started, we knew what we wanted to call it 😀

0

1

Swarnadeep Saha

@swarnaNLP

3 months

Next, Pointwise-J1 at 8B + 70B scales:. 1⃣ Mitigates position bias. 2⃣ Improves position-consistent accuracy. 3⃣ Reduces ties in pairwise judgments. Finally, test-time scaling of J1 leads to further improvements, for both Pairwise and Pointwise models, at both scales

1

0

Swarnadeep Saha

@swarnaNLP

3 months

We tested J1 on 5 benchmarks w/ verifiable + non-verifiable + multilingual instructions at 8B+70B scales. First, Pairwise-J1 outperforms:. 1⃣ open + closed LLM judges . 2⃣ SOTA scalar + generative RMs. 3⃣ R1-distilled-Llama + o1-mini. 4⃣ a much larger R1 on non-verifiable tasks

1

0

1

Swarnadeep Saha

@swarnaNLP

3 months

🧑‍🍳 J1 Recipe:. -- Generate synthetic preference pairs as training data for both verifiable+non-verifiable tasks. -- Train Pairwise-J1 using GRPO with verdict correctness+consistency rewards. -- Train Pointwise-J1 using GRPO with distant pairwise supervision + score-based rewards

1

0

2

Swarnadeep Saha

@swarnaNLP

3 months

Excited to share that EvalPlanner is accepted to #ICML2025! . To make meaningful progress in AI, we need strong evaluators, and specifically those that can reason. Stay tuned for more updates, as we continue to make progress in this space! 😀.

Jason Weston

@jaseweston

6 months

💭🔎 Introducing EvalPlanner – a method to train a Thinking-LLM-as-a-Judge that learns to generate planning & reasoning CoTs for evaluation. Strong performance on RewardBench, RM-Bench, JudgeBench & FollowBenchEval. Paper 📄:

5

11

82

Swarnadeep Saha

@swarnaNLP

3 months

RT @SomnathBrc: 𝐇𝐨𝐰 𝐜𝐚𝐧 𝐰𝐞 𝐩𝐞𝐫𝐟𝐞𝐜𝐭𝐥𝐲 𝐞𝐫𝐚𝐬𝐞 𝐜𝐨𝐧𝐜𝐞𝐩𝐭𝐬 𝐟𝐫𝐨𝐦 𝐋𝐋𝐌𝐬?. Our method, Perfect Erasure Functions (PEF), erases concepts from LLM repre….

0

35

0

Swarnadeep Saha

@swarnaNLP

4 months

RT @tesatory: Ten years ago in 2015 we published a paper called End-to-End Memory Networks (. Looking back, this pa….

0

120

0

Swarnadeep Saha

@swarnaNLP

4 months

RT @jaseweston: 🚨Multi-Token Attention🚨.📝: Attention is critical for LLMs, but its weights are computed by single….

0

148

0