Hao Peng
            
            @haopeng_uiuc
Followers
                638
              Following
                44
              Media
                0
              Statuses
                41
              Assistant Professor @ UIUC CS | PhD from UW | Formerly @allen_ai, @GoogleDeepMind, @Google
              
              Joined October 2020
            
            
           A lot is said about LLMs’ counterfactual reasoning, but do they truly possess the cognitive skills it needs? Introducing Executable Counterfactuals, a code framework that (1) shows frontier models lack these skills (2) offers a testbed for improvement via Reinforcement Learning 
          
                
                3
              
              
                
                21
              
              
                
                55
              
             🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗:  https://t.co/4Ud8qsYrOT 
          
          
                
                13
              
              
                
                91
              
              
                
                429
              
             So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔 Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵 
          
                
                4
              
              
                
                18
              
              
                
                131
              
             Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update 
          
                
                12
              
              
                
                64
              
              
                
                406
              
             🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive 
          
                
                17
              
              
                
                131
              
              
                
                886
              
             💡We find that models “think” 💭 in English (or in general, their dominant language) when processing distinct non-English or even non-language data types 🤯 like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputs ‼️ 🧵⬇️  https://t.co/IfatE7GL1q 
          
          
                
                8
              
              
                
                66
              
              
                
                301
              
             🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations like hallucinations by developing new models—Retrieval-Augmented LMs—to build more reliable real-world AI systems. Learn more in the thread! 🧵 
          
                
                26
              
              
                
                119
              
              
                
                820
              
             I'm on the academic job market! I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks. I do this by building frontier-pushing benchmarks and agents that do well on them. See you at NeurIPS! 
          
                
                9
              
              
                
                39
              
              
                
                230
              
             Wanna train PRMs but process labels, annotated manually or automatically, sound too expensive to you😖? Introduce Implicit PRM🚀 – Get your model free process rewards by training an ORM on the cheaper response-level data, with a simple parameterization at no additional cost💰! 
          
                
                3
              
              
                
                48
              
              
                
                212
              
             Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises 
          
                
                41
              
              
                
                210
              
              
                
                1K
              
             What If LLMs can cite the pre-training source(s) supporting their parametric knowledge? Won't this dramatically improve verifiability and trustworthiness? We aimed to answer this during my internship @allen_ai Paper:  https://t.co/ZyXz99chAd  To be presented at #COLM Thread👇👇 
          
            
            arxiv.org
              Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source...
            
                
                3
              
              
                
                15
              
              
                
                108
              
             🎯 Introducing SOLO, a single Transformer architecture for unified vision-language modeling. SOLO accepts both raw image patches (in pixels) and texts as inputs, without using a separate pre-trained vision encoder. Paper:  https://t.co/7fGF8RlSSw  Code:  https://t.co/zjXHRV9ckB 
          
          
                
                14
              
              
                
                53
              
              
                
                236
              
             Language models excel at undergraduate exams, but how do they fare in research? SciCode challenges models with real research coding problems. Even the best models solve less than 5%. Very proud of @MinyangTian1 and @luyu_gao for leading the charge! 
           SciCode is our new benchmark that challenges LMs to code solutions for scientific problems from advanced papers. The challenges were crafted by PhDs; ~10% of our benchmark is based on Nobel-winning research. GPT-4 and Sonnet 3.5 get <5% ACC.  https://t.co/OtNadtSICO  🧵 1/6 
            
                
                0
              
              
                
                0
              
              
                
                11
              
             I'm joining the UIUC @UofIllinois this fall as an Assistant Professor in the iSchool, with an affiliation in Computer Science! My research passion lies in the intersection of NLP and the medical domain. I'm recruiting students for 2025! Check more info:  https://t.co/pRTwWR5bFd. 
          
          
                
                24
              
              
                
                29
              
              
                
                368
              
             From Claude100K to Gemini10M, we are in the era of long context language models. Why and how a language model can utilize information at any input locations within long context? We discover retrieval heads, a special type of attention head responsible for long-context factuality 
          
                
                22
              
              
                
                169
              
              
                
                846
              
             Want to train an aligned LM in a new language 🌏 but don’t have preference data for training the reward model (RM)? 💡 Just use a RM for another language: it often works well, sometimes even BETTER than if you had a RM in your target language! 🤯  https://t.co/Rlw3U5B4Ih 
          
          
                
                3
              
              
                
                37
              
              
                
                152
              
             SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code  https://t.co/CTzMxDiouH 
          
          
                
                63
              
              
                
                417
              
              
                
                2K
              
             Very proud of Eurus. A huge shoutout to @lifan__yuan and @charlesfornlp for leading this! 
           Introducing 🚀Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract🎉! Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks! 
            
                
                0
              
              
                
                2
              
              
                
                20
              
             Very proud of Eurus. A huge shoutout to @lifan__yuan and @charlesfornlp for leading this! 
           This is a joint work with @charlesfornlp, @wanghanbin95, @stingning, @xingyaow_, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, and advisors Bowen Zhou, @haopeng_nlp, @zibuyu9, Maosong Sun. cc @TsinghuaNLP @uiuc_nlp
            
          
                
                0
              
              
                
                0
              
              
                
                5
              
             Frontier models all have at least 100k context length, Gemini 1.5 has even 1m context. What about research and open source? Introducing Long Context Data Engineering, a data driven method achieving the first 128k context open source model matching GPT4-level Needle in a 
          
                
                8
              
              
                
                66
              
              
                
                459