tiancheng_hu Profile Banner
Tiancheng Hu Profile
Tiancheng Hu

@tiancheng_hu

Followers
1K
Following
464
Media
45
Statuses
261

PhD student @CambridgeLTL @Cambridge_Uni. @Apple Scholar, @Gates_Cambridge Scholar. Previously @MSP_UTD @UT_Dallas @ETH_en @EPFL_en. Interested in NLP and CSS

Joined July 2021
Don't wanna be here? Send us removal request.
@tiancheng_hu
Tiancheng Hu
7 hours
Very cool work!
@_maiush
Sharan
19 hours
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
0
0
1
@tiancheng_hu
Tiancheng Hu
6 days
Great fun working on this with @bminixhofer and @nigelhcollier at @CambridgeLTL @Cambridge_Uni. Special thanks to Paul Martin, and @arcee_ai's Mergekit library. Could be of interest to @xingyudang @_christinabaek @AdtRaghunathan @daphneipp @NickATomlin @realJessyLin
1
0
4
@tiancheng_hu
Tiancheng Hu
6 days
TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints. Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application. Paper here: https://t.co/kXDeKHCj4u (8/8)
Tweet card summary image
arxiv.org
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model...
1
0
6
@tiancheng_hu
Tiancheng Hu
6 days
Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)
1
0
3
@tiancheng_hu
Tiancheng Hu
6 days
And it gets better with scale! 📈 The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
1
0
2
@tiancheng_hu
Tiancheng Hu
6 days
The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)
1
0
2
@tiancheng_hu
Tiancheng Hu
6 days
It's NOT a zero-sum game between base and instruct. We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
1
0
2
@tiancheng_hu
Tiancheng Hu
6 days
Our solution is simple and computationally cheap: model merging. By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed. (3/8)
1
0
3
@tiancheng_hu
Tiancheng Hu
6 days
Let's start by redefining the problem. We argue the "alignment tax" MUST include the severe loss of model calibration. Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)
1
0
2
@tiancheng_hu
Tiancheng Hu
6 days
Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident. You face a choice: a well-calibrated base model or a capable but unreliable instruct model. What if you didn't have to choose? What if you could navigate the trade-off?
3
4
14
@tiancheng_hu
Tiancheng Hu
6 days
The most rigorious and user-centric eval on the topic of political bias - I highly recommend anyone interested in this topic to check Paul's paper out!
@paul_rottger
Paul Röttger @ EMNLP
7 days
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵
0
0
5
@tiancheng_hu
Tiancheng Hu
8 days
Huge thanks to my amazing collaborators @joabaum,@lorelupo @nigelhcollier @dirk_hovy and especially @paul_rottger @CambridgeLTL @Cambridge_Uni Work partially done during my visit to @MilaNLProc @Unibocconi. Highly recommended!
1
0
2
@tiancheng_hu
Tiancheng Hu
8 days
Check out the paper and data for details! Paper: https://t.co/FKdsJS9kug Data: https://t.co/VA49waXnvb Website: https://t.co/EDsCgfqiD9 (9/9)
1
0
2
@tiancheng_hu
Tiancheng Hu
8 days
Overall, by making progress measurable, SImBench provides the foundation to build more faithful LLM simulators. Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models.
1
0
2
@tiancheng_hu
Tiancheng Hu
8 days
This brings us back to our earlier question: What makes a good simulator? We find simulation ability correlates most strongly with deep, knowledge-intensive general reasoning (MMLU-Pro, r=0.94), rather than chat (Arena Elo, r=0.71) or competition math (AIME, r=0.48) To simulate
1
0
2
@tiancheng_hu
Tiancheng Hu
8 days
Why does this happen? We dug deeper and found two opposing forces: ✅ a helpful direct effect (+6.46 score): models get much better at following instructions ❌ a harmful indirect effect (-1.74 score): models become less diverse The challenge: how do we get the good without the
1
0
2
@tiancheng_hu
Tiancheng Hu
8 days
There’s also an alignment-simulation tradeoff: Instruction-tuning (the process that makes LLMs helpful and safe) improves their ability to predict consensus opinions. BUT, it actively harms their ability to predict diverse, pluralistic opinions where humans disagree. This echos
1
0
3
@tiancheng_hu
Tiancheng Hu
8 days
We found a clear log-linear scaling trend. Across the model families we could test, bigger models are consistently better simulators. Performance reliably increases with model size. This suggests that future, larger models hold the potential to become highly accurate simulators.
1
0
1
@tiancheng_hu
Tiancheng Hu
8 days
The best model we tested on release, Claude 3.7 Sonnet, scores just 40.8 out of 100. A lot of room for improvement for LLM social simulators! Interestingly, more test-time compute doesn’t help. This suggests that simulation requires a different type of reasoning than math /
1
0
3