Gaotang Li @GaotangLi X Profile

Gaotang Li

@GaotangLi

Followers

143

Following

141

Media

11

Statuses

77

Ph.D. @UofIllinois | Undergrad @UMich. Language Models.

https://t.co/u6AY37B8oh

Joined November 2024

Don't wanna be here? Send us removal request.

Gaotang Li

@GaotangLi

2 months

Negative Log-Likelihood (NLL) has long been the go-to objective for classification and SFT, but is it universally optimal? We explore when alternative objectives outperform NLL and when they don't, based on two key factors: the objective's prior-leaningness and the model's

3

22

125

Rohan Paul

@rohanpaul_ai

1 day

New Stanford + Princeton + Illinois Univ paper shows language model agents can collaborate through hidden vectors instead of text, giving better answers with less compute. In benchmarks they reach around 4x faster inference with roughly 70% to 80% fewer output tokens than strong

20

61

346

DAIR.AI

@dair_ai

4 days

Multi-agent systems are powerful but expensive. However, the cost isn't in the reasoning itself. It's in the communication. Agents exchange full text messages, consuming tokens for every coordination step. When agents need to collaborate on complex problems, this overhead adds

17

139

666

Devvrit

@Devvrit_Khatri

1 month

Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs

9

103

552

Jiaru "Rubin" Zou @ NeurIPS

@Jiaru_Zou

2 months

🚨Releasing 𝗥𝗔𝗚 𝗼𝘃𝗲𝗿 𝗧𝗮𝗯𝗹𝗲𝘀: 𝗛𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝗶𝗰𝗮𝗹 𝗠𝗲𝗺𝗼𝗿𝘆 𝗜𝗻𝗱𝗲𝘅, 𝗠𝘂𝗹𝘁𝗶-𝗦𝘁𝗮𝗴𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹, 𝗮𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴. 🚀 We push RAG to Multi-tables! 🌐Code: https://t.co/kppOT6uXBC 📄Paper: https://t.co/KclevgAf6z

1

4

8

Ming Zhong

@MingZhong_

2 months

Vibe coding with an LLM, but the final vibe is off? 🤔 We analyze why models fail the "vibe check" and what truly matters to users. Key insight: human preference 🧑‍💻 ≈ functional correctness ✅ + instruction following 🎯 Check out our paper: https://t.co/s5gGME5O9I

2

17

70

Jiaru "Rubin" Zou @ NeurIPS

@Jiaru_Zou

2 months

🚀 Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! ⚙️We introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. 📄 Paper: https://t.co/PnGbmvIJzw [1/n]

5

4

5

Gaotang Li

@GaotangLi

2 months

Thanks for posting our work!

Rohan Paul

@rohanpaul_ai

2 months

Fine-tuning should not always use log-loss, the right loss depends on model strength. On math tasks with strong priors, dropping low-confidence tokens boosts accuracy by up to 16%. The big takeaway is a clear rule for choosing losses by capability. It shows when to trust

0

3

Xiaoyan Bai✈️Neurips2025

@Elenal3ai

2 months

🚨Why can’t Transformers learn multiplication?🧮 Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n 🧵

12

109

683

Yu Zhang @ NeurIPS 25 (didn't review for ICLR 26)

@yuz9yuz

2 months

"Can submission authors rely on online discussions of review scores to estimate their percentile?" A recent study led by my student Hangxiao Zhu @FlyPig23, in collaboration with Prof. Yian Yin @yian_yin from Cornell, gives a clear 𝐍𝐄𝐆𝐀𝐓𝐈𝐕𝐄 answer! (1/n)

3

7

18

Gaotang Li

@GaotangLi

2 months

A big shout-out to our amazing collaborators! ❤ @Ruizhong_Qiu @xiusi_chen @hengjinlp @hanghangtong (13/n)

0

1

Gaotang Li

@GaotangLi

2 months

Key takeaway: Match your objective to model capability. The right objective depends on where your task lies on the model-capability continuum: • Model-Strong (extensive priors) → Prior-leaning objectives (-p, (1-p¹⁰)/10, thresholded objectives) • Model-Weak (no priors) →

arxiv.org

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training...

0

2

Gaotang Li

@GaotangLi

2 months

Beyond empirics, we provide theoretical analysis via gradient flow characterization. We establish sufficient conditions for when objectives reverse their relative effectiveness in the Model-weak and Model-strong ends. (11/n)

1

0

1

Gaotang Li

@GaotangLi

2 months

As a case study, we analyze the parametric family -(1-p^α)/α, varying α from 0 to 10 (a higher α means more prior-leaning), measuring both test performance and likelihood estimation on the training set after fine-tuning. (10/n)

1

0

Gaotang Li

@GaotangLi

2 months

Model-Strong ablation: We threshold training by quantiles of base model predicted probabilities, comparing -log(p), -p, log(1-p) (emphasis shifts low-p→high-p). Key finding: Low-probability tokens consistently harm all objectives! (9/n)

1

0

Gaotang Li

@GaotangLi

2 months

Model-Intermediate region (moderate p̄): No clear winner. When models have partial priors, neither prior-leaning nor prior-averse objectives dominate. No universal solution here. (8/n)

1

0

Gaotang Li

@GaotangLi

2 months

Model-Weak end (low p̄): The tables turn—NLL dominates. On novel figlet tasks where models start near-random: e.g., Qwen2.5-7B: NLL 82.48 vs -p 10.15 Prior-leaning objectives fail catastrophically. Standard NLL is optimal when "learning from scratch". (7/n)

1

0

1

Gaotang Li

@GaotangLi

2 months

The gains are substantial and consistent: E.g., Qwen2.5-Math-7B: Thresholded NLL gets 38.16% avg (vs 22.67% for NLL) — +70% relative gain Prior-leaning objectives consistently outperform across the standard math benchmarks. (6/n)

1

0

1

Gaotang Li

@GaotangLi

2 months

Model-Strong end (high p̄): Prior-leaning objectives win. We test -p and -log(p)1{p≥0.2} (thresholding out low-p tokens), both more prior-leaning than standard NLL. Results across LLaMA, DeepSeek, Qwen models ↓ (5/n)

1

0

2

Gaotang Li

@GaotangLi

2 months

How do we quantify model capability? We measure the mean predicted probability of all tokens in the training data before fine-tuning—this reveals how much the base model already knows. The differences are dramatic ↓ (4/n)

1

0

2

Gaotang Li

@GaotangLi

2 months

Now, what is "model capability"? How much task-relevant knowledge the base model already encodes before fine-tuning: • Model-Strong (MS): Strong priors (e.g., 25% of pretraining tokens are math reasoning according to Llama-3 report) • Model-Intermediate (MI): Partial priors

1

0

3