weizhepei Profile Banner
Zhepei Wei Profile
Zhepei Wei

@weizhepei

Followers
235
Following
2K
Media
31
Statuses
109

Ph.D. Student @CS_UVA | Research Intern @AIatMeta. Previously @AmazonScience. Research interest: ML/NLP/LLM.

Charlottesville, VA
Joined January 2016
Don't wanna be here? Send us removal request.
@weizhepei
Zhepei Wei
1 month
🤔Ever wondered why your post-training methods (SFT/RL) make LLMs reluctant to say “I don't know?” 🤩Introducing TruthRL — a truthfulness-driven RL method that significantly reduces hallucinations while achieving accuracy and proper abstention! 📃 https://t.co/OXPYb09nz7 🧵[1/n]
2
14
65
@weizhepei
Zhepei Wei
1 month
🤯Using fully self-synthetic training data (i.e., prompts, responses, preferences) to align LLMs is wild! Check out @shangjian8460’s recent work!
@shangjian8460
Shangjian Yin
1 month
🚀 Excited to share our new preprint: "Aligning Large Language Models via Fully Self-Synthetic Data". A long-term, scalable, and data-efficient self-improving framework for preference optimization from scratch!🤗 📄 https://t.co/JenQ7ojP3R 💻 https://t.co/YyOx3DOI6F
0
1
2
@weizhepei
Zhepei Wei
1 month
Outcome-only reward often leads to deficient search behaviors, we find that decoupling the RL training for search and answering leads to smarter search agents! Great work led by @blancokdb!
@blancokdb
Yiding Wang @EMNLP2025
1 month
Most RL methods train search agents on final answers, assuming good search will follow. But do they? 🤔 Our research finds this assumption is flawed❗️ 🤩Meet DeSA (Decoupling Search and Answering) – a 2-stage framework improving both search quality and answer accuracy! 🧵[1/n]
0
0
4
@Jiaru_Zou
Jiaru "Rubin" Zou
1 month
🚀 Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! ⚙️We introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. 📄 Paper: https://t.co/PnGbmvIJzw [1/n]
5
4
5
@ym_tseng
Yu-Min Tseng
1 month
Presenting our #COLM2025 paper 🥳 LLMs can achieve expert-level proficiency in academic/professional benchmarks - can they be direct alternatives to human expert annotators?🤔 A: LLMs still lag behind human experts! Reasoning has limited help; parallel agents are promising! 🧵
3
16
89
@rohanpaul_ai
Rohan Paul
1 month
New @AIatMeta paper trains LLMs to be more truthful by rewarding correct answers and say I do not know when unsure. Reports 28.9% fewer hallucinations and 21.1% higher truthfulness compared with standard training. The big deal is a simple reward rule that separates I do not
11
19
144
@_akhaliq
AK
1 month
TruthRL Incentivizing Truthful LLMs via Reinforcement Learning
2
34
178
@HuggingPapers
DailyPapers
1 month
TruthRL tackles a core LLM challenge: balancing accuracy with honest abstention. Instead of guessing, models learn to say "I don't know" when appropriate, leading to more trustworthy AI. Learn more:
Tweet card summary image
huggingface.co
0
1
7
@weizhepei
Zhepei Wei
1 month
Please read our paper for more interesting findings (e.g., robustness to LLM judges, scalability from 3B to 32B models, etc.) Huge thanks to my amazing collaborators: Rulin Shao @RulinShao, Yu Meng @yumeng0818, Scott Yih @scottyih, Xin Luna Dong, and many others! 🙌 🧵[10/n]
0
0
1
@weizhepei
Zhepei Wei
1 month
⏩ Beyond outcome reward – incorporating reasoning rewards Outcome-only rewards implicitly improve reasoning ability, while explicitly optimizing reasoning quality requires non-trivial design to balance multiple objectives 🧵[9/n]
1
0
2
@weizhepei
Zhepei Wei
1 month
Online RL vs. Offline and Semi-Online RL: 📵Purely offline RL via DPO leads to limited gains 🔄Semi-online training through iterative DPO provides some remedy, but the performance is inconsistent 🌐TruthRL with online RL (GRPO) consistently achieves the best results 🧵[8/n]
1
0
2
@weizhepei
Zhepei Wei
1 month
📊After training with TruthRL, LLMs become more confident in giving correct answers and abstaining, while the hallucination rate is significantly lower 🧵[7/n]
1
0
2
@weizhepei
Zhepei Wei
1 month
🔎Ablation study: Binary reward design excels in accuracy but is limited in truthfulness, while ternary reward achieves the best truthfulness score with strong accuracy 🧵[6/n]
1
0
2
@weizhepei
Zhepei Wei
1 month
TruthRL improves LLMs in recognizing their knowledge boundaries: ✅TruthRL enables LLMs to abstain from answering mostly when they genuinely lack knowledge ✅TruthRL is robust to hallucination-baiting questions where candidate answers are provided in the input 🧵[5/n]
1
0
1
@weizhepei
Zhepei Wei
1 month
🤯Vanilla SFT/RL increases both accuracy and hallucination, ultimately compromising truthfulness 🏆TruthRL achieves the lowest hallucination and highest truthfulness, while maintaining high accuracy Evaluation metrics: Truthfulness (T) = Hallucination (H) - Accuracy (A) 🧵[4/n]
1
0
2
@weizhepei
Zhepei Wei
1 month
💡A simple yet effective ternary RL objective: rewards correctness, penalizes hallucination, treats abstention as neutral 🤩Naturally favors abstention over hallucination under GRPO advantage 🚀Up to 28.9% less hallucination and 21.1% more truthfulness across 4 benchmarks 🧵[3/n]
1
0
1
@weizhepei
Zhepei Wei
1 month
🎯 Factual accuracy alone does not necessarily guarantee truthfulness — A model that answers fewer questions correctly while reliably abstaining when uncertain is FAR MORE trustworthy than a higher-accuracy model that frequently fabricates plausible but incorrect answers 🧵[2/n]
1
0
2
@rohanpaul_ai
Rohan Paul
2 months
OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident
97
340
2K
@ProphetArena
Prophet Arena
3 months
🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today’s dots? 👉 What makes it special? - It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen
90
144
1K
@jiaxinhuang0229
Jiaxin Huang
3 months
Thrilled to share this exciting work, R-Zero, from my student @ChengsongH31219 where LLM learns to reason from Zero human-curated data! The framework includes co-evolution of a "Challenger" to propose difficult tasks and a "Solver" to solve them. Check out more details in the
@ChengsongH31219
ChengSong Huang
3 months
🚀🚀Excited to share our paper R-Zero: Self-Evolving Reasoning LLM from Zero Data ! How to train LLM without data? R-Zero teaches Large Language Models to reason starting with nothing but a base model. No data required!!! Paper: https://t.co/z4tCJFTXUG Code:
1
4
23