Tiancheng Hu
@tiancheng_hu
Followers
1K
Following
464
Media
45
Statuses
261
PhD student @CambridgeLTL @Cambridge_Uni. @Apple Scholar, @Gates_Cambridge Scholar. Previously @MSP_UTD @UT_Dallas @ETH_en @EPFL_en. Interested in NLP and CSS
Joined July 2021
Very cool work!
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
0
0
1
Great fun working on this with @bminixhofer and @nigelhcollier at @CambridgeLTL @Cambridge_Uni. Special thanks to Paul Martin, and @arcee_ai's Mergekit library. Could be of interest to @xingyudang @_christinabaek
@AdtRaghunathan @daphneipp @NickATomlin @realJessyLin
1
0
4
TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints. Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application. Paper here: https://t.co/kXDeKHCj4u (8/8)
arxiv.org
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model...
1
0
6
Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)
1
0
3
And it gets better with scale! 📈 The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
1
0
2
The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)
1
0
2
It's NOT a zero-sum game between base and instruct. We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
1
0
2
Our solution is simple and computationally cheap: model merging. By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed. (3/8)
1
0
3
Let's start by redefining the problem. We argue the "alignment tax" MUST include the severe loss of model calibration. Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)
1
0
2
Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident. You face a choice: a well-calibrated base model or a capable but unreliable instruct model. What if you didn't have to choose? What if you could navigate the trade-off?
3
4
14
The most rigorious and user-centric eval on the topic of political bias - I highly recommend anyone interested in this topic to check Paul's paper out!
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵
0
0
5
Huge thanks to my amazing collaborators @joabaum,@lorelupo @nigelhcollier @dirk_hovy and especially @paul_rottger
@CambridgeLTL @Cambridge_Uni Work partially done during my visit to @MilaNLProc @Unibocconi. Highly recommended!
1
0
2
Check out the paper and data for details! Paper: https://t.co/FKdsJS9kug Data: https://t.co/VA49waXnvb Website: https://t.co/EDsCgfqiD9 (9/9)
1
0
2
Overall, by making progress measurable, SImBench provides the foundation to build more faithful LLM simulators. Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models.
1
0
2
This brings us back to our earlier question: What makes a good simulator? We find simulation ability correlates most strongly with deep, knowledge-intensive general reasoning (MMLU-Pro, r=0.94), rather than chat (Arena Elo, r=0.71) or competition math (AIME, r=0.48) To simulate
1
0
2
Why does this happen? We dug deeper and found two opposing forces: ✅ a helpful direct effect (+6.46 score): models get much better at following instructions ❌ a harmful indirect effect (-1.74 score): models become less diverse The challenge: how do we get the good without the
1
0
2
There’s also an alignment-simulation tradeoff: Instruction-tuning (the process that makes LLMs helpful and safe) improves their ability to predict consensus opinions. BUT, it actively harms their ability to predict diverse, pluralistic opinions where humans disagree. This echos
1
0
3
We found a clear log-linear scaling trend. Across the model families we could test, bigger models are consistently better simulators. Performance reliably increases with model size. This suggests that future, larger models hold the potential to become highly accurate simulators.
1
0
1
The best model we tested on release, Claude 3.7 Sonnet, scores just 40.8 out of 100. A lot of room for improvement for LLM social simulators! Interestingly, more test-time compute doesn’t help. This suggests that simulation requires a different type of reasoning than math /
1
0
3