 
            
              Nitish Joshi
            
            @nitishjoshi23
Followers
                1K
              Following
                5K
              Media
                12
              Statuses
                196
              Research Scientist @GoogleDeepmind. Prev - PhD from NYU, undergrad - IIT Bombay. Birding @nitishbird
              
              California, USA
            
            
              
              Joined June 2018
            
            
           New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9) 
          
                
                11
              
              
                
                60
              
              
                
                441
              
             Reward hacking means the model is making less effort than expected: it finds the answer long before its fake CoT is finished. TRACE uses this idea to detect hacking when CoT monitoring fails. Work led by @XinpengWang_ @nitishjoshi23 and @rico_angell👇 
           ‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual 
            
                
                2
              
              
                
                11
              
              
                
                129
              
             Monitoring CoT may be insufficient to detect reward hacking. We develop a very simple method to detect such implicit reward hacking - truncate CoT, force predict answer, and use the AUC of the %CoT vs expected reward curve as a measure. Last project of my PhD! 
           ‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual 
            
                
                0
              
              
                
                4
              
              
                
                18
              
             Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity @AdobeResearch @Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries 
          
                
                1
              
              
                
                14
              
              
                
                62
              
             An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵 
          
                
                156
              
              
                
                741
              
              
                
                4K
              
             Honored to get the outstanding position paper award at @icmlconf :) Come attend my talk and poster tomorrow on human centered considerations for a safer and better future of work I will be recruiting PhD students at @stonybrooku @sbucompsc coming fall. Please get in touch. 
           Very excited for a new #ICML2025 position paper accepted as oral w @mbodhisattwa & @TuhinChakr! 😎 What are the longitudinal harms of AI development? We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality. 
            
                
                8
              
              
                
                13
              
              
                
                77
              
             📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜  https://t.co/SDBD5uIdbp  🧵1/5👇 
          
                
                4
              
              
                
                39
              
              
                
                280
              
             Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas. 
          
                
                12
              
              
                
                191
              
              
                
                633
              
             What causes jailbreaks to transfer between LLMs? We find that jailbreak strength and model representation similarity predict transferability, and we can engineer model similarity to improve transfer. Details in🧵 
          
                
                3
              
              
                
                13
              
              
                
                54
              
             LLMs won’t tell you how to make fake IDs—but will reveal the layouts/materials of IDs and make realistic photos if asked separately. 💥Such decomposition attacks reach 87% success across QA, text-to-image, and agent settings! 🛡️Our monitoring method defends with 93% success! 🧵 
          
                
                2
              
              
                
                12
              
              
                
                28
              
             Nice to see folks studying biases in RLHF / preference tuning all the way down to the datasets. I think many of the biases are mostly irreducible human biases that can't be solved within current training regimes, just mitigated. 
           Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓ 
            
                
                0
              
              
                
                9
              
              
                
                72
              
             Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓ 
          
                
                1
              
              
                
                23
              
              
                
                76
              
             What does it mean for #LLM output to be novel? In work w/ @jcyhc_ai, @JanePan_, @valeriechen_, @hhexiy we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵 
          
                
                2
              
              
                
                29
              
              
                
                83
              
             Reasoning models overthink, generating multiple answers during reasoning. Is it because they can’t tell which ones are right? No! We find while reasoning models encode strong correctness signals during chain-of-thought, they may not use them optimally. 🧵 below 
          
                
                10
              
              
                
                80
              
              
                
                388
              
             Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/ 
          
                
                16
              
              
                
                63
              
              
                
                260
              
             First set of Llama 4!! 
           Introducing our first set of Llama 4 models! We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for the Llama herd as we release the *first* open source models in the Llama 4 
            
                
                0
              
              
                
                4
              
              
                
                22
              
             My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏 
           New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. 
            
                
                32
              
              
                
                88
              
              
                
                1K
              
             2018: Saliency maps give plausible interpretations of random weights, triggering skepticism and catalyzing the mechinterp cultural movement, which now advocates for SAEs. 2025: SAEs give plausible interpretations of random weights, triggering skepticism and ... 
          
                
                9
              
              
                
                23
              
              
                
                301
              
             When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6] 
          
                
                2
              
              
                
                17
              
              
                
                54
              
             Remember this study about how LLM generated research ideas were rated to be more novel than expert-written ones? We find a large fraction of such LLM generated proposals (≥ 24%) to be skillfully plagiarized, bypassing inbuilt plagiarism checks and unsuspecting experts. A 🧵 
           Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers. 
            
                
                32
              
              
                
                252
              
              
                
                2K
              
             
             
             
               
             
             
             
               
             
             
             
             
             
               
             
             
             
               
             
               
             
            