Zhepei Wei
@weizhepei
Followers
235
Following
2K
Media
31
Statuses
109
Ph.D. Student @CS_UVA | Research Intern @AIatMeta. Previously @AmazonScience. Research interest: ML/NLP/LLM.
Charlottesville, VA
Joined January 2016
🤔Ever wondered why your post-training methods (SFT/RL) make LLMs reluctant to say “I don't know?” 🤩Introducing TruthRL — a truthfulness-driven RL method that significantly reduces hallucinations while achieving accuracy and proper abstention! 📃 https://t.co/OXPYb09nz7 🧵[1/n]
2
14
65
🤯Using fully self-synthetic training data (i.e., prompts, responses, preferences) to align LLMs is wild! Check out @shangjian8460’s recent work!
🚀 Excited to share our new preprint: "Aligning Large Language Models via Fully Self-Synthetic Data". A long-term, scalable, and data-efficient self-improving framework for preference optimization from scratch!🤗 📄 https://t.co/JenQ7ojP3R 💻 https://t.co/YyOx3DOI6F
0
1
2
Outcome-only reward often leads to deficient search behaviors, we find that decoupling the RL training for search and answering leads to smarter search agents! Great work led by @blancokdb!
Most RL methods train search agents on final answers, assuming good search will follow. But do they? 🤔 Our research finds this assumption is flawed❗️ 🤩Meet DeSA (Decoupling Search and Answering) – a 2-stage framework improving both search quality and answer accuracy! 🧵[1/n]
0
0
4
🚀 Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! ⚙️We introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. 📄 Paper: https://t.co/PnGbmvIJzw [1/n]
5
4
5
Presenting our #COLM2025 paper 🥳 LLMs can achieve expert-level proficiency in academic/professional benchmarks - can they be direct alternatives to human expert annotators?🤔 A: LLMs still lag behind human experts! Reasoning has limited help; parallel agents are promising! 🧵
3
16
89
New @AIatMeta paper trains LLMs to be more truthful by rewarding correct answers and say I do not know when unsure. Reports 28.9% fewer hallucinations and 21.1% higher truthfulness compared with standard training. The big deal is a simple reward rule that separates I do not
11
19
144
TruthRL tackles a core LLM challenge: balancing accuracy with honest abstention. Instead of guessing, models learn to say "I don't know" when appropriate, leading to more trustworthy AI. Learn more:
huggingface.co
0
1
7
Please read our paper for more interesting findings (e.g., robustness to LLM judges, scalability from 3B to 32B models, etc.) Huge thanks to my amazing collaborators: Rulin Shao @RulinShao, Yu Meng @yumeng0818, Scott Yih @scottyih, Xin Luna Dong, and many others! 🙌 🧵[10/n]
0
0
1
⏩ Beyond outcome reward – incorporating reasoning rewards Outcome-only rewards implicitly improve reasoning ability, while explicitly optimizing reasoning quality requires non-trivial design to balance multiple objectives 🧵[9/n]
1
0
2
Online RL vs. Offline and Semi-Online RL: 📵Purely offline RL via DPO leads to limited gains 🔄Semi-online training through iterative DPO provides some remedy, but the performance is inconsistent 🌐TruthRL with online RL (GRPO) consistently achieves the best results 🧵[8/n]
1
0
2
📊After training with TruthRL, LLMs become more confident in giving correct answers and abstaining, while the hallucination rate is significantly lower 🧵[7/n]
1
0
2
🔎Ablation study: Binary reward design excels in accuracy but is limited in truthfulness, while ternary reward achieves the best truthfulness score with strong accuracy 🧵[6/n]
1
0
2
TruthRL improves LLMs in recognizing their knowledge boundaries: ✅TruthRL enables LLMs to abstain from answering mostly when they genuinely lack knowledge ✅TruthRL is robust to hallucination-baiting questions where candidate answers are provided in the input 🧵[5/n]
1
0
1
🤯Vanilla SFT/RL increases both accuracy and hallucination, ultimately compromising truthfulness 🏆TruthRL achieves the lowest hallucination and highest truthfulness, while maintaining high accuracy Evaluation metrics: Truthfulness (T) = Hallucination (H) - Accuracy (A) 🧵[4/n]
1
0
2
💡A simple yet effective ternary RL objective: rewards correctness, penalizes hallucination, treats abstention as neutral 🤩Naturally favors abstention over hallucination under GRPO advantage 🚀Up to 28.9% less hallucination and 21.1% more truthfulness across 4 benchmarks 🧵[3/n]
1
0
1
🎯 Factual accuracy alone does not necessarily guarantee truthfulness — A model that answers fewer questions correctly while reliably abstaining when uncertain is FAR MORE trustworthy than a higher-accuracy model that frequently fabricates plausible but incorrect answers 🧵[2/n]
1
0
2
OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident
97
340
2K
🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today’s dots? 👉 What makes it special? - It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen
90
144
1K
Thrilled to share this exciting work, R-Zero, from my student @ChengsongH31219 where LLM learns to reason from Zero human-curated data! The framework includes co-evolution of a "Challenger" to propose difficult tasks and a "Solver" to solve them. Check out more details in the
🚀🚀Excited to share our paper R-Zero: Self-Evolving Reasoning LLM from Zero Data ! How to train LLM without data? R-Zero teaches Large Language Models to reason starting with nothing but a base model. No data required!!! Paper: https://t.co/z4tCJFTXUG Code:
1
4
23