Wenqi Shi
            
            @WenqiShi0106
Followers
                296
              Following
                970
              Media
                11
              Statuses
                149
              Assistant Professor @UTSWMedCenter | Ph.D. @GeorgiaTech | LLMs | Agent | RAG | EHRs | Clinical Decision Support | Pediatric Healthcare
              
              Dallas, TX
            
            
              
              Joined November 2023
            
            
           ๐ค How can we systematically enhance LLMs for complex medical coding tasks? ๐ Introducing MedAgentGym, an interactive gym-style platform designed specifically for training LLM agents in coding-based medical reasoning! ๐งฌ๐ป ๐ฏ Comprehensive Code-based Medical Reasoning 
          
                
                9
              
              
                
                20
              
              
                
                132
              
             New Google paper trains LLM judges to use small bits of code alongside reasoning, so their decisions become precise. So judging stops being guesswork and becomes checkable. Text only judges often miscount, miss structure rules, or accept shaky logic that a simple program would 
          
                
                3
              
              
                
                17
              
              
                
                161
              
             We just built and released the largest dataset for supervised fine-tuning of agentic LMs, 1.27M trajectories (~36B tokens)! Up until now, large-scale SFT for agents is rare - not for lack of data, but because of fragmentation across heterogeneous formats, tools, and interfaces. 
          
            
            arxiv.org
              Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue...
            
                
                26
              
              
                
                169
              
              
                
                1K
              
             ๐จ Eigen-1 gets 48.3% (Pass@1) & 61.74% (Pass@5) on "Humanity's Last Exam" (HLE) gold subset @FutureHouseSF using DeepSeek V3.1. Prev. Grok4->30.2%, GPT-5->22.8%, Gemini 2.5 Pro->18.8% ๐  https://t.co/4Fhcp8VTBG  The future isn't bigger models, it's smarter agentic design! ๐ 
          
                
                1
              
              
                
                4
              
              
                
                40
              
             With deep research revolutionizing research/data analysis, why are we still stuck in manually crafting data viz? Meet CoDA (  https://t.co/g8yjnHiMHM):  The ultimate multi-agent LLM powerhouse for auto-generating stunning plots from NL queries! Handles complex data, self-refines 
          
                
                8
              
              
                
                10
              
              
                
                109
              
             New @GoogleResearch paper builds a personal health assistant that reads a userโs data, answers health questions, and coaches daily habits. It evaluates the system on 10 tasks with 7,000+ human annotations and 1,100 hours from experts and users. The assistant covers 4 needs, 
          
                
                25
              
              
                
                170
              
              
                
                1K
              
             RLAD (Reinforcement Learning with Abstraction and Deduction) trains models via RL using a 2-player setup: โช๏ธ An abstraction generator โ proposes short, natural-language โreasoning hintsโ (abstractions) summarizing key facts and strategies. โช๏ธ A solution generator โ uses them to 
          
                
                11
              
              
                
                60
              
              
                
                297
              
             ๐ Tired of choosing between speed and accuracy for Diffusion Large Language Models? Meet FreeDave (  https://t.co/KzdsX3TrnX)  โ the lossless parallel decoding algorithm that fixes DLLMsโ inference pain points perfectly! No extra draft models, no model tweaks โ just smart parallel 
          
                
                2
              
              
                
                22
              
              
                
                144
              
             ๐ Introducing BroRL: Scaling Reinforcement Learning via Broadened Exploration When step-scaling hits a plateau, scale rollouts, not steps. BroRL takes reinforcement learning beyond saturationโreviving stalled models by expanding exploration with large-N rollouts. ๐ (1/n) 
          
                
                20
              
              
                
                44
              
              
                
                210
              
             ๐ซ This @Microsoft paper brings really bad news for medical AI models. Exposes some serious flaws. AI models just arenโt ready yet for reliable medical reasoning. ๐คฏ Paper finds that medical AI model pass tests by exploiting patterns in the data, not by actually combining 
          
                
                20
              
              
                
                57
              
              
                
                233
              
             Cool research paper from Google. This is what clever context engineering looks like. It proposes Tool-Use-Mixture (TUMIX), leveraging diverse tool-use strategies to improve reasoning. This work shows how to get better reasoning from LLMs by running a bunch of diverse agents 
          
                
                31
              
              
                
                151
              
              
                
                735
              
             The paper teaches small LLMs to reason better by training with built in tree search. i.e. Smarter exploration beats longer training runs. It reaches 62.95% average accuracy while using 5.7x fewer GPU hours. Typical reinforcement learning with verifiable rewards stalls because 
          
                
                3
              
              
                
                28
              
              
                
                160
              
             ๐จ Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight! ๐น One LLM, two roles: Decomposer (split queries) + Solver (combine context) ๐น +7.6% on QA & fact verification ๐น 32B โ DeepSeek-V3 on DocMath ๐ Code:  https://t.co/lQU12Dm7vb  ๐ arXiv:  https://t.co/JI0kOh0yDk 
          
          
                
                1
              
              
                
                16
              
              
                
                25
              
             ReasoningBank: memory for self-evolving LLM agents โข Distills strategies from both successes & failures โข Enables agents to learn, reuse, and improve over time โข Outperforms prior memory methods on web & SWE tasks (+34.2% eff., โ16% steps) 
          
                
                12
              
              
                
                114
              
              
                
                599
              
             Baichuan-M2: Scaling Medical Capability with Large Verifier System Baichuan has released what is probably right now the best open-source LLM for medicine! Second only to GPT-5! "Despite its relatively small number of parameters (only 32B), Baichuan-M2 outperformed all other 
          
                
                5
              
              
                
                20
              
              
                
                112
              
             ๐Our paper "Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety" has been accepted to EMNLP 2025 Main Track! @emnlpmeeting ๐First survey connecting LLM interpretation & safety 
          
                
                4
              
              
                
                20
              
              
                
                176
              
             โจ Alongside NVIDIA-Nemotron-Nano-v2-9B, weโre also open-sourcing its pre-training dataset. At NVIDIA, we remain committed to openness โ models + datasets. As the global open-source ecosystem rapidly evolves (with remarkable momentum emerging from Asia and beyond), we stand 
          
            
            huggingface.co
             This week, we open-sourced NVIDIA-Nemotron-Nano-v2-9B: our next-generation efficient hybrid model. - 6ร faster than Qwen3-8B at reasoning tasks. - Retained long-context capability (8k โ 262k trained, usable at 128k) First true demonstration that reasoning models can be 
            
                
                5
              
              
                
                15
              
              
                
                118
              
             ๐ Welcome to CTCAE6 GO โ fast CTCAE v6/v5 reference, built by a clinician. ๐ Trainees always free ๐ก Share how you're using Pro to learn, teach, or save time in clinic. Tag us or DM your best workflow or use case! =>longer access as CTCAE6 GO Elite User . Free PRO CODE below 
          
                
                1
              
              
                
                4
              
              
                
                6
              
             ๐ DeepCode: Open Agentic Coding is Here! We dropped DeepCode - an AI-powered coding platform that transforms research papers and technical documents into production-ready code! ๐ Fully Open Source:  https://t.co/vBzRhcVAsN  โจ Current Features: โข Paper2Code: Convert research 
          
                
                11
              
              
                
                168
              
              
                
                747