Nitish Joshi @nitishjoshi23 X Profile

Nitish Joshi

@nitishjoshi23

Followers

1K

Following

5K

Media

12

Statuses

196

Research Scientist @GoogleDeepmind. Prev - PhD from NYU, undergrad - IIT Bombay. Birding @nitishbird

https://t.co/3bZaEOadcB

California, USA

Joined June 2018

Don't wanna be here? Send us removal request.

Ziqian Zhong

@fjzzq2002

7 days

New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9)

11

60

441

He He

@hhexiy

17 days

Reward hacking means the model is making less effort than expected: it finds the answer long before its fake CoT is finished. TRACE uses this idea to detect hacking when CoT monitoring fails. Work led by @XinpengWang_ @nitishjoshi23 and @rico_angell👇

Xinpeng Wang

@XinpengWang_

24 days

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual

2

11

129

Nitish Joshi

@nitishjoshi23

24 days

Monitoring CoT may be insufficient to detect reward hacking. We develop a very simple method to detect such implicit reward hacking - truncate CoT, force predict answer, and use the AUC of the %CoT vs expected reward curve as a measure. Last project of my PhD!

Xinpeng Wang

@XinpengWang_

24 days

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual

0

4

18

Vishakh Padmakumar

@vishakh_pk

3 months

Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity @AdobeResearch @Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries

1

14

62

Google DeepMind

@GoogleDeepMind

3 months

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

156

741

4K

Tuhin Chakrabarty

@TuhinChakr

4 months

Honored to get the outstanding position paper award at @icmlconf :) Come attend my talk and poster tomorrow on human centered considerations for a safer and better future of work I will be recruiting PhD students at @stonybrooku @sbucompsc coming fall. Please get in touch.

Sanchaita Hazra

@hsanchaita

6 months

Very excited for a new #ICML2025 position paper accepted as oral w @mbodhisattwa & @TuhinChakr! 😎 What are the longitudinal harms of AI development? We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

8

13

77

Michael Hu

@michahu8

4 months

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜 https://t.co/SDBD5uIdbp 🧵1/5👇

4

39

280

CLS

@ChengleiSi

4 months

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

12

191

633

Rico Angell

@rico_angell

4 months

What causes jailbreaks to transfer between LLMs? We find that jailbreak strength and model representation similarity predict transferability, and we can engineer model similarity to improve transfer. Details in🧵

3

13

54

John (Yueh-Han) Chen

@jcyhc_ai

5 months

LLMs won’t tell you how to make fake IDs—but will reveal the layouts/materials of IDs and make realistic photos if asked separately. 💥Such decomposition attacks reach 87% success across QA, text-to-image, and agent settings! 🛡️Our monitoring method defends with 93% success! 🧵

2

12

28

Nathan Lambert

@natolambert

5 months

Nice to see folks studying biases in RLHF / preference tuning all the way down to the datasets. I think many of the biases are mostly irreducible human biases that can't be solved within current training regimes, just mitigated.

Chaitanya Malaviya

@cmalaviya11

5 months

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

0

9

72

Chaitanya Malaviya

@cmalaviya11

5 months

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

1

23

76

Vishakh Padmakumar

@vishakh_pk

6 months

What does it mean for #LLM output to be novel? In work w/ @jcyhc_ai, @JanePan_, @valeriechen_, @hhexiy we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

2

29

83

Yulin Chen

@YulinChen99

7 months

Reasoning models overthink, generating multiple answers during reasoning. Is it because they can’t tell which ones are right? No! We find while reasoning models encode strong correctness signals during chain-of-thought, they may not use them optimally. 🧵 below

10

80

388

Naman Jain

@StringChaos

7 months

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

16

63

260

Richard Pang

@yzpang_

7 months

First set of Llama 4!!

Ahmad Al-Dahle

@Ahmad_Al_Dahle

7 months

Introducing our first set of Llama 4 models! We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for the Llama herd as we release the *first* open source models in the Llama 4

0

4

22

Yanda Chen

@yanda_chen_

7 months

My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏

Anthropic

@AnthropicAI

7 months

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

32

88

1K

Naomi Saphra

@nsaphra

8 months

2018: Saliency maps give plausible interpretations of random weights, triggering skepticism and catalyzing the mechinterp cultural movement, which now advocates for SAEs. 2025: SAEs give plausible interpretations of random weights, triggering skepticism and ...

9

23

301

Jane Pan

@JanePan_

8 months

When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6]

2

17

54

Danish Pruthi

@danish037

8 months

Remember this study about how LLM generated research ideas were rated to be more novel than expert-written ones? We find a large fraction of such LLM generated proposals (≥ 24%) to be skillfully plagiarized, bypassing inbuilt plagiarism checks and unsuspecting experts. A 🧵

CLS

@ChengleiSi

1 year

Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers.

32

252

2K