Jiaxin Huang
@jiaxinhuang0229
Followers
567
Following
218
Media
4
Statuses
37
Assistant professor @WUSTL CSE. LLM, NLP, ML, Data Mining. PhD from @IllinoisCS. Microsoft Research PhD Fellow.
Joined June 2022
Vibe coding with an LLM, but the final vibe is off? 🤔 We analyze why models fail the "vibe check" and what truly matters to users. Key insight: human preference 🧑💻 ≈ functional correctness ✅ + instruction following 🎯 Check out our paper: https://t.co/s5gGME5O9I
2
17
69
Thrilled to share this exciting work, R-Zero, from my student @ChengsongH31219 where LLM learns to reason from Zero human-curated data! The framework includes co-evolution of a "Challenger" to propose difficult tasks and a "Solver" to solve them. Check out more details in the
🚀🚀Excited to share our paper R-Zero: Self-Evolving Reasoning LLM from Zero Data ! How to train LLM without data? R-Zero teaches Large Language Models to reason starting with nothing but a base model. No data required!!! Paper: https://t.co/z4tCJFTXUG Code:
1
4
23
We @Zai_org are thrilled to open-source GLM-4.1V-9B-Thinking, a VLM that can think with long CoTs. SoTA in <10B VLMs, comparable to Qwen-2.5-VL-72B in 18 tasks. One RL to rule them all! Details - Tech report: https://t.co/sxsKy2xP2P - Code: https://t.co/O8WXX7vK0F
3
9
30
Excited to share our #ICML25 paper (led by @weizhepei) on accelerating LLM decoding! ⚡️ AdaDecode predicts tokens early from intermediate layers 🙅♂️No drafter model needed 🪶Just lightweight LM heads ✨Output consistency with standard autoregressive decoding Thread👇
⚠️ New #ICML2025 paper! Want faster and accurate LLM decoding? Check out AdaDecode! 🚀 ⚙️ Adaptive token prediction at intermediate layers w/o full forward pass! 🎯 Identical output to standard decoding! 🧩 No draft model — just a lightweight LM head (0.2% model size)! 🧵[1/n]
1
5
34
🚀🚀Excited to share our new work on Speculative Decoding by @shrangoh! We tackle a key limitation in draft models which predict worse tokens at later positions, and present PosS that generates high-quality drafts!
New Research Released! 🚀PosS: Position Specialist Generates Better Draft for Speculative Decoding Is your LLM fast enough? PosS consistently improves over current speculative decoding methods by using position-specialized draft layers to generate high-quality drafts! 🔖Paper:
1
3
10
We enabled OLMoTrace for Tülu 3 models! 🤠 Matched spans are shorter than for OLMo models, bc we can only search in Tülu's post-training data (base model is Llama). Yet we thought it'd still bring some value. Try yourself on the Ai2 playground -- https://t.co/xGDdIR99De
2
16
47
🚀 Introducing RAST: Reasoning Activation via Small Model Transfer! ✨ RAST adjusts key "reasoning tokens" at decoding time using insights from smaller RL-tuned models — no full RL tuning for large models! ⚡ Efficient & Performant,🧠 Scalable & Easy,📉 Up to 50% less GPU memory!
3
21
117
What truly drives reasoning in RLVR? Check out our new paper led by @tianhongzxy for some fascinating insights and analysis!! 🤩
🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔 🚀Introducing our new paper👇 💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO! 🧵[1/n]
0
3
27
Thrilled to be named to the Forbes 30 Under 30 Asia 2025 list! 🤩 Excited to keep pushing the boundaries of LLMs to tackle real-world challenges 🙌
Just launched: Meet Asia’s Forbes 30 Under 30, Class of 2025 https://t.co/Ry5JDY2rza
#ForbesU30Asia #ForbesUnder30
7
12
122
Our paper was accepted by @icmlconf 2025! If you're working on RL for reasoning, consider adding more logical puzzle data to your training and eval. Share your ideas for logical reasoning tasks for ZebraLogic v2 and interesting RL studies you want to see! Many thanks to my
If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits. 🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs)
2
4
96
Sorry to miss ICLR this year — but if you're interested in the 𝐥𝐨𝐧𝐠-𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐋𝐋𝐌 𝐯𝐬. 𝐑𝐀𝐆 𝐝𝐞𝐛𝐚𝐭𝐞, don’t miss our poster! My amazing collaborator from Google will be there to chat and share insights. 📍 Hall 3 + Hall 2B #302 🕒 Thu, Apr 24 | 3:00–5:30
Long-Context LLMs Meet RAG For many long-context LLMs, the quality of outputs declines as the number of passages increases. It seems that the performance loss is due to retrieved hard negatives. They propose two ways to improve long-context LLM-based RAG: 1) retrieval
1
19
89
I will be presenting our poster for the “Law of the Weakest Link” paper at ICLR today! If you're interested in this topic, feel free to stop by and chat! 📍 Location: Hall 3 + Hall 2B #257 ⏰ Time: Apr 25 | 10:00 AM – 12:30 PM SGT
Excited to share our recent work! We define and benchmark cross capabilities in LLMs, revealing the "Law of the Weakest Link": collaborative performance clusters around the weakest individual capability. 📄 Paper: https://t.co/GjxWmdyQ9Y 🌐 Website:
0
8
36
🚀I’ll be at #ICLR2025! Our group is presenting: Apr 25: Reward Calibration in RLHF Apr 26: Generative Joint Graph Language Modeling Apr 27/28: Logit Arithmetic Approach for In-Context Learning (SLLM, Reasoning & Planning Workshop) 😆 Let’s chat about LLM research, PhD
0
12
66
Can LVLMs solve crossword puzzles? Our evaluation of over 20 LLMs and LVLMs finds that LVLMs largely lag behind LLMs due to poor vertical word extraction. Reasoning LLMs (like o3-mini) outperform non-reasoning models, benefitting from cross-letter constraints!
🚀 Reasoning models are acing complex math and science problems, but what about the everyday puzzles we solve for fun? We introduce a new benchmark, CrossWordBench, designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles. We
0
0
6
LLMs can now trace their outputs to their training data. 🤯 I cover the implications of @allen_ai's new OLMoTrace feature on @thenewstack today. https://t.co/FZdiGwmRQj
thenewstack.io
Ai2’s OLMoTrace uses string matching to reveal the exact sources behind chatbot responses
3
10
38
Thrilled to share our recent work "Efficient Test-Time Scaling via Self-Calibration"! We introduce a smart way to boost LLM efficiency in test-time scaling without sacrificing accuracy🧠! By using self-calibrated confidence scores, we enable early stopping in Best-of-N and
🚀🚀New Research Alert: Efficient Test-Time Scaling via Self-Calibration! ❓How to dynamically allocate computational resources in repeated sampling methods? 💡We propose an efficient test-time scaling method by using model confidence for dynamically sampling adjustment, since
0
2
13
🚀 Exciting opportunity for LLM multi-agent researchers at the Agent Society Challenge at WWW 2025! Monetary prizes are $12,000 in total and top teams will be recommended to publish their results in the WWW Companion proceedings🥳 More details can be found here:
0
1
14
🤨Ever wonder why RLHF-trained LLMs are overconfident? 🚀Check out our new work led by @JixuanLeng, revealing that reward models themselves are biased towards high-confidence responses!😯 🥳We introduce two practical solutions (PPO-M & PPO-C) to improve language model
🚀 RLHF-trained LLMs often show overconfidence in their expressed confidence levels. Our NEW PAPER reveals why – reward models tend to favor highly confident responses, even when responses are wrong! We introduce two PPO variants, PPO with Calibrated Reward Modeling and PPO with
0
2
27
🔍 Which reward model characteristics best predict RLHF performance? We evaluated RMs & LLM-judges on: - Human preference agreement on Chatbot Arena - Accuracy in selecting correct code/math answers - Correlation with Chatbot Arena rankings Interesting finding: Lower-bound
🔥New benchmark: Preference Proxy Evaluations (PPE) Can reward models guide RLHF? Can LLM judge replace real human evals? PPE addresses these questions! Highlights: - Real-world human preference from Chatbot Arena💬 - 16,000+ prompts and 32,000+ diverse model responses🗿 -
1
3
35
Curious about efficient many-shot ICL with LLMs? Our new paper led by @ChengsongH31219 introduces LARA that divides & reweights in-context examples to ensure ✅ Better Performance ✅ Improved Scalability ✅ No need to Access Model Parameters ✅ Less Memory Usage
🚀🚀Excited to share our paper Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning ! How to implement efficient inference in many-shot in-context learning setting? We propose LARA and B-LARA for LLM efficient inference by dividing input
0
1
9