Zhepei Wei @weizhepei X Profile

Zhepei Wei

@weizhepei

Followers

235

Following

2K

Media

31

Statuses

109

Ph.D. Student @CS_UVA | Research Intern @AIatMeta. Previously @AmazonScience. Research interest: ML/NLP/LLM.

https://t.co/fNIgNCMx0l

Charlottesville, VA

Joined January 2016

Don't wanna be here? Send us removal request.

Zhepei Wei

@weizhepei

1 month

🤔Ever wondered why your post-training methods (SFT/RL) make LLMs reluctant to say “I don't know?” 🤩Introducing TruthRL — a truthfulness-driven RL method that significantly reduces hallucinations while achieving accuracy and proper abstention! 📃 https://t.co/OXPYb09nz7 🧵[1/n]

2

14

65

Zhepei Wei

@weizhepei

1 month

🤯Using fully self-synthetic training data (i.e., prompts, responses, preferences) to align LLMs is wild! Check out @shangjian8460’s recent work!

Shangjian Yin

@shangjian8460

1 month

🚀 Excited to share our new preprint: "Aligning Large Language Models via Fully Self-Synthetic Data". A long-term, scalable, and data-efficient self-improving framework for preference optimization from scratch！🤗 📄 https://t.co/JenQ7ojP3R 💻 https://t.co/YyOx3DOI6F

0

1

2

Zhepei Wei

@weizhepei

1 month

Outcome-only reward often leads to deficient search behaviors, we find that decoupling the RL training for search and answering leads to smarter search agents! Great work led by @blancokdb!

Yiding Wang @EMNLP2025

@blancokdb

1 month

Most RL methods train search agents on final answers, assuming good search will follow. But do they? 🤔 Our research finds this assumption is flawed❗️ 🤩Meet DeSA (Decoupling Search and Answering) – a 2-stage framework improving both search quality and answer accuracy! 🧵[1/n]

0

4

Jiaru "Rubin" Zou

@Jiaru_Zou

1 month

🚀 Introducing ##TaTToo: Tool-Grounded Thinking PRM for Test Time Scaling in Tabular Reasoning! ⚙️We introduce the first PRM that explicitly leverages tools during its reasoning process for robust step-verification. 📄 Paper: https://t.co/PnGbmvIJzw [1/n]

5

4

5

Yu-Min Tseng

@ym_tseng

1 month

Presenting our #COLM2025 paper 🥳 LLMs can achieve expert-level proficiency in academic/professional benchmarks - can they be direct alternatives to human expert annotators?🤔 A: LLMs still lag behind human experts! Reasoning has limited help; parallel agents are promising! 🧵

3

16

89

Rohan Paul

@rohanpaul_ai

1 month

New @AIatMeta paper trains LLMs to be more truthful by rewarding correct answers and say I do not know when unsure. Reports 28.9% fewer hallucinations and 21.1% higher truthfulness compared with standard training. The big deal is a simple reward rule that separates I do not

11

19

144

AK

@_akhaliq

1 month

TruthRL Incentivizing Truthful LLMs via Reinforcement Learning

2

34

178

DailyPapers

@HuggingPapers

1 month

TruthRL tackles a core LLM challenge: balancing accuracy with honest abstention. Instead of guessing, models learn to say "I don't know" when appropriate, leading to more trustworthy AI. Learn more:

huggingface.co

0

1

7

Zhepei Wei

@weizhepei

1 month

Please read our paper for more interesting findings (e.g., robustness to LLM judges, scalability from 3B to 32B models, etc.) Huge thanks to my amazing collaborators: Rulin Shao @RulinShao, Yu Meng @yumeng0818, Scott Yih @scottyih, Xin Luna Dong, and many others! 🙌 🧵[10/n]

0

1

Zhepei Wei

@weizhepei

1 month

⏩ Beyond outcome reward – incorporating reasoning rewards Outcome-only rewards implicitly improve reasoning ability, while explicitly optimizing reasoning quality requires non-trivial design to balance multiple objectives 🧵[9/n]

1

0

2

Zhepei Wei

@weizhepei

1 month

Online RL vs. Offline and Semi-Online RL: 📵Purely offline RL via DPO leads to limited gains 🔄Semi-online training through iterative DPO provides some remedy, but the performance is inconsistent 🌐TruthRL with online RL (GRPO) consistently achieves the best results 🧵[8/n]

1

0

2

Zhepei Wei

@weizhepei

1 month

📊After training with TruthRL, LLMs become more confident in giving correct answers and abstaining, while the hallucination rate is significantly lower 🧵[7/n]

1

0

2

Zhepei Wei

@weizhepei

1 month

🔎Ablation study: Binary reward design excels in accuracy but is limited in truthfulness, while ternary reward achieves the best truthfulness score with strong accuracy 🧵[6/n]

1

0

2

Zhepei Wei

@weizhepei

1 month

TruthRL improves LLMs in recognizing their knowledge boundaries: ✅TruthRL enables LLMs to abstain from answering mostly when they genuinely lack knowledge ✅TruthRL is robust to hallucination-baiting questions where candidate answers are provided in the input 🧵[5/n]

1

0

1

Zhepei Wei

@weizhepei

1 month

🤯Vanilla SFT/RL increases both accuracy and hallucination, ultimately compromising truthfulness 🏆TruthRL achieves the lowest hallucination and highest truthfulness, while maintaining high accuracy Evaluation metrics: Truthfulness (T) = Hallucination (H) - Accuracy (A) 🧵[4/n]

1

0

2

Zhepei Wei

@weizhepei

1 month

💡A simple yet effective ternary RL objective: rewards correctness, penalizes hallucination, treats abstention as neutral 🤩Naturally favors abstention over hallucination under GRPO advantage 🚀Up to 28.9% less hallucination and 21.1% more truthfulness across 4 benchmarks 🧵[3/n]

1

0

1

Zhepei Wei

@weizhepei

1 month

🎯 Factual accuracy alone does not necessarily guarantee truthfulness — A model that answers fewer questions correctly while reliably abstaining when uncertain is FAR MORE trustworthy than a higher-accuracy model that frequently fabricates plausible but incorrect answers 🧵[2/n]

1

0

2

Rohan Paul

@rohanpaul_ai

2 months

OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident

97

340

2K

Prophet Arena

@ProphetArena

3 months

🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today’s dots? 👉 What makes it special? - It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen

90

144

1K

Jiaxin Huang

@jiaxinhuang0229

3 months

Thrilled to share this exciting work, R-Zero, from my student @ChengsongH31219 where LLM learns to reason from Zero human-curated data! The framework includes co-evolution of a "Challenger" to propose difficult tasks and a "Solver" to solve them. Check out more details in the

ChengSong Huang

@ChengsongH31219

3 months

🚀🚀Excited to share our paper R-Zero: Self-Evolving Reasoning LLM from Zero Data ! How to train LLM without data? R-Zero teaches Large Language Models to reason starting with nothing but a base model. No data required!!! Paper: https://t.co/z4tCJFTXUG Code:

1

4

23