Gaotang Li Profile
Gaotang Li

@GaotangLi

Followers
143
Following
141
Media
11
Statuses
77

Ph.D. @UofIllinois | Undergrad @UMich. Language Models.

Joined November 2024
Don't wanna be here? Send us removal request.
@GaotangLi
Gaotang Li
2 months
Negative Log-Likelihood (NLL) has long been the go-to objective for classification and SFT, but is it universally optimal? We explore when alternative objectives outperform NLL and when they don't, based on two key factors: the objective's prior-leaningness and the model's
3
22
125
@rohanpaul_ai
Rohan Paul
1 day
New Stanford + Princeton + Illinois Univ paper shows language model agents can collaborate through hidden vectors instead of text, giving better answers with less compute. In benchmarks they reach around 4x faster inference with roughly 70% to 80% fewer output tokens than strong
20
61
346
@dair_ai
DAIR.AI
4 days
Multi-agent systems are powerful but expensive. However, the cost isn't in the reasoning itself. It's in the communication. Agents exchange full text messages, consuming tokens for every coordination step. When agents need to collaborate on complex problems, this overhead adds
17
139
666
@Devvrit_Khatri
Devvrit
1 month
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
9
103
552
@Jiaru_Zou
Jiaru "Rubin" Zou @ NeurIPS
2 months
๐ŸšจReleasing ๐—ฅ๐—”๐—š ๐—ผ๐˜ƒ๐—ฒ๐—ฟ ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€: ๐—›๐—ถ๐—ฒ๐—ฟ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐—ฐ๐—ฎ๐—น ๐— ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—œ๐—ป๐—ฑ๐—ฒ๐˜…, ๐— ๐˜‚๐—น๐˜๐—ถ-๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ ๐—ฅ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฎ๐—น, ๐—ฎ๐—ป๐—ฑ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐—ถ๐—ป๐—ด. ๐Ÿš€ We push RAG to Multi-tables! ๐ŸŒCode: https://t.co/kppOT6uXBC ๐Ÿ“„Paper: https://t.co/KclevgAf6z
1
4
8
@MingZhong_
Ming Zhong
2 months
Vibe coding with an LLM, but the final vibe is off? ๐Ÿค” We analyze why models fail the "vibe check" and what truly matters to users. Key insight: human preference ๐Ÿง‘โ€๐Ÿ’ป โ‰ˆ functional correctness โœ… + instruction following ๐ŸŽฏ Check out our paper: https://t.co/s5gGME5O9I
2
17
70
@Jiaru_Zou
Jiaru "Rubin" Zou @ NeurIPS
2 months
๐Ÿš€ Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! โš™๏ธWe introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. ๐Ÿ“„ Paper: https://t.co/PnGbmvIJzw [1/n]
5
4
5
@GaotangLi
Gaotang Li
2 months
Thanks for posting our work!
@rohanpaul_ai
Rohan Paul
2 months
Fine-tuning should not always use log-loss, the right loss depends on model strength. On math tasks with strong priors, dropping low-confidence tokens boosts accuracy by up to 16%. The big takeaway is a clear rule for choosing losses by capability. It shows when to trust
0
0
3
@Elenal3ai
Xiaoyan Baiโœˆ๏ธNeurips2025
2 months
๐ŸšจWhy canโ€™t Transformers learn multiplication?๐Ÿงฎ Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n ๐Ÿงต
12
109
683
@yuz9yuz
Yu Zhang @ NeurIPS 25 (didn't review for ICLR 26)
2 months
"Can submission authors rely on online discussions of review scores to estimate their percentile?" A recent study led by my student Hangxiao Zhu @FlyPig23, in collaboration with Prof. Yian Yin @yian_yin from Cornell, gives a clear ๐๐„๐†๐€๐“๐ˆ๐•๐„ answer! (1/n)
3
7
18
@GaotangLi
Gaotang Li
2 months
A big shout-out to our amazing collaborators! โค @Ruizhong_Qiu @xiusi_chen @hengjinlp @hanghangtong (13/n)
0
0
1
@GaotangLi
Gaotang Li
2 months
Key takeaway: Match your objective to model capability. The right objective depends on where your task lies on the model-capability continuum: โ€ข Model-Strong (extensive priors) โ†’ Prior-leaning objectives (-p, (1-pยนโฐ)/10, thresholded objectives) โ€ข Model-Weak (no priors) โ†’
Tweet card summary image
arxiv.org
Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training...
0
0
2
@GaotangLi
Gaotang Li
2 months
Beyond empirics, we provide theoretical analysis via gradient flow characterization. We establish sufficient conditions for when objectives reverse their relative effectiveness in the Model-weak and Model-strong ends. (11/n)
1
0
1
@GaotangLi
Gaotang Li
2 months
As a case study, we analyze the parametric family -(1-p^ฮฑ)/ฮฑ, varying ฮฑ from 0 to 10 (a higher ฮฑ means more prior-leaning), measuring both test performance and likelihood estimation on the training set after fine-tuning. (10/n)
1
0
0
@GaotangLi
Gaotang Li
2 months
Model-Strong ablation: We threshold training by quantiles of base model predicted probabilities, comparing -log(p), -p, log(1-p) (emphasis shifts low-pโ†’high-p). Key finding: Low-probability tokens consistently harm all objectives! (9/n)
1
0
0
@GaotangLi
Gaotang Li
2 months
Model-Intermediate region (moderate pฬ„): No clear winner. When models have partial priors, neither prior-leaning nor prior-averse objectives dominate. No universal solution here. (8/n)
1
0
0
@GaotangLi
Gaotang Li
2 months
Model-Weak end (low pฬ„): The tables turnโ€”NLL dominates. On novel figlet tasks where models start near-random: e.g., Qwen2.5-7B: NLL 82.48 vs -p 10.15 Prior-leaning objectives fail catastrophically. Standard NLL is optimal when "learning from scratch". (7/n)
1
0
1
@GaotangLi
Gaotang Li
2 months
The gains are substantial and consistent: E.g., Qwen2.5-Math-7B: Thresholded NLL gets 38.16% avg (vs 22.67% for NLL) โ€” +70% relative gain Prior-leaning objectives consistently outperform across the standard math benchmarks. (6/n)
1
0
1
@GaotangLi
Gaotang Li
2 months
Model-Strong end (high pฬ„): Prior-leaning objectives win. We test -p and -log(p)1{pโ‰ฅ0.2} (thresholding out low-p tokens), both more prior-leaning than standard NLL. Results across LLaMA, DeepSeek, Qwen models โ†“ (5/n)
1
0
2
@GaotangLi
Gaotang Li
2 months
How do we quantify model capability? We measure the mean predicted probability of all tokens in the training data before fine-tuningโ€”this reveals how much the base model already knows. The differences are dramatic โ†“ (4/n)
1
0
2
@GaotangLi
Gaotang Li
2 months
Now, what is "model capability"? How much task-relevant knowledge the base model already encodes before fine-tuning: โ€ข Model-Strong (MS): Strong priors (e.g., 25% of pretraining tokens are math reasoning according to Llama-3 report) โ€ข Model-Intermediate (MI): Partial priors
1
0
3