Gaotang Li
@GaotangLi
Followers
143
Following
141
Media
11
Statuses
77
Ph.D. @UofIllinois | Undergrad @UMich. Language Models.
Joined November 2024
Negative Log-Likelihood (NLL) has long been the go-to objective for classification and SFT, but is it universally optimal? We explore when alternative objectives outperform NLL and when they don't, based on two key factors: the objective's prior-leaningness and the model's
3
22
125
New Stanford + Princeton + Illinois Univ paper shows language model agents can collaborate through hidden vectors instead of text, giving better answers with less compute. In benchmarks they reach around 4x faster inference with roughly 70% to 80% fewer output tokens than strong
20
61
346
Multi-agent systems are powerful but expensive. However, the cost isn't in the reasoning itself. It's in the communication. Agents exchange full text messages, consuming tokens for every coordination step. When agents need to collaborate on complex problems, this overhead adds
17
139
666
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
9
103
552
๐จReleasing ๐ฅ๐๐ ๐ผ๐๐ฒ๐ฟ ๐ง๐ฎ๐ฏ๐น๐ฒ๐: ๐๐ถ๐ฒ๐ฟ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐ฐ๐ฎ๐น ๐ ๐ฒ๐บ๐ผ๐ฟ๐ ๐๐ป๐ฑ๐ฒ๐
, ๐ ๐๐น๐๐ถ-๐ฆ๐๐ฎ๐ด๐ฒ ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น, ๐ฎ๐ป๐ฑ ๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ๐ถ๐ป๐ด. ๐ We push RAG to Multi-tables! ๐Code: https://t.co/kppOT6uXBC ๐Paper: https://t.co/KclevgAf6z
1
4
8
Vibe coding with an LLM, but the final vibe is off? ๐ค We analyze why models fail the "vibe check" and what truly matters to users. Key insight: human preference ๐งโ๐ป โ functional correctness โ
+ instruction following ๐ฏ Check out our paper: https://t.co/s5gGME5O9I
2
17
70
๐ Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! โ๏ธWe introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. ๐ Paper: https://t.co/PnGbmvIJzw [1/n]
5
4
5
Thanks for posting our work!
Fine-tuning should not always use log-loss, the right loss depends on model strength. On math tasks with strong priors, dropping low-confidence tokens boosts accuracy by up to 16%. The big takeaway is a clear rule for choosing losses by capability. It shows when to trust
0
0
3
๐จWhy canโt Transformers learn multiplication?๐งฎ Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n ๐งต
12
109
683
A big shout-out to our amazing collaborators! โค @Ruizhong_Qiu @xiusi_chen @hengjinlp @hanghangtong (13/n)
0
0
1
Key takeaway: Match your objective to model capability. The right objective depends on where your task lies on the model-capability continuum: โข Model-Strong (extensive priors) โ Prior-leaning objectives (-p, (1-pยนโฐ)/10, thresholded objectives) โข Model-Weak (no priors) โ
arxiv.org
Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training...
0
0
2
Beyond empirics, we provide theoretical analysis via gradient flow characterization. We establish sufficient conditions for when objectives reverse their relative effectiveness in the Model-weak and Model-strong ends. (11/n)
1
0
1
As a case study, we analyze the parametric family -(1-p^ฮฑ)/ฮฑ, varying ฮฑ from 0 to 10 (a higher ฮฑ means more prior-leaning), measuring both test performance and likelihood estimation on the training set after fine-tuning. (10/n)
1
0
0
Model-Strong ablation: We threshold training by quantiles of base model predicted probabilities, comparing -log(p), -p, log(1-p) (emphasis shifts low-pโhigh-p). Key finding: Low-probability tokens consistently harm all objectives! (9/n)
1
0
0
Model-Intermediate region (moderate pฬ): No clear winner. When models have partial priors, neither prior-leaning nor prior-averse objectives dominate. No universal solution here. (8/n)
1
0
0
Model-Weak end (low pฬ): The tables turnโNLL dominates. On novel figlet tasks where models start near-random: e.g., Qwen2.5-7B: NLL 82.48 vs -p 10.15 Prior-leaning objectives fail catastrophically. Standard NLL is optimal when "learning from scratch". (7/n)
1
0
1
The gains are substantial and consistent: E.g., Qwen2.5-Math-7B: Thresholded NLL gets 38.16% avg (vs 22.67% for NLL) โ +70% relative gain Prior-leaning objectives consistently outperform across the standard math benchmarks. (6/n)
1
0
1
Model-Strong end (high pฬ): Prior-leaning objectives win. We test -p and -log(p)1{pโฅ0.2} (thresholding out low-p tokens), both more prior-leaning than standard NLL. Results across LLaMA, DeepSeek, Qwen models โ (5/n)
1
0
2
How do we quantify model capability? We measure the mean predicted probability of all tokens in the training data before fine-tuningโthis reveals how much the base model already knows. The differences are dramatic โ (4/n)
1
0
2
Now, what is "model capability"? How much task-relevant knowledge the base model already encodes before fine-tuning: โข Model-Strong (MS): Strong priors (e.g., 25% of pretraining tokens are math reasoning according to Llama-3 report) โข Model-Intermediate (MI): Partial priors
1
0
3