Tong Chen @ NeurIPS
@tomchen0
Followers
840
Following
213
Media
27
Statuses
167
OpenAI's blog ( https://t.co/Mu05PFfPXg) points out that todayโs language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?๐ค On-policy RL with
27
123
674
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
โ ๏ธDifferent models. Same thoughts.โ ๏ธ Todayโs AI models converge into an ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ข๐ฏ๐๐ฆ๐ข๐ง๐ ๐, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 ๐&๐ ๐๐ซ๐๐ฅ ๐ฉ๐๐ฉ๐๐ซ (โจ๐ญ๐จ๐ฉ ๐.๐๐%โจ) dives deep into
37
67
784
I'll be at #NeurIPS2025 until 12/7! I work on post-training and reward signals (Spurious Rewards), currently curious about bridging the gap between how humans and LLMs learn. Looking forward to connecting with new and old friendsโalso exploring summer 2025 internships. DMs open!
3
7
55
I will be at #NeurIPS2025 12.3โ12.7 Looking forward to meeting old and new friends ! โ๏ธ๐ฎ Recently working on hallucination (Binary RAR) and verbatim memorization (ParaPO), issues that scaling up pretraining cannot simply fix. Also interested in making models learn more like
1
5
36
8B model can outperform AlphaEvolve on open optimization problems by scaling compute for inference or test-time RL๐! โญCircle packing: AlphaEvolve (Gemini-2.0-Flash/Pro) : 2.63586276 Ours (DeepSeek-R1-0528-Qwen3-8B) : 2.63598308 ๐in๐งต [1/n]
6
50
190
PhD applicants โ Join Akariโs first cohort of students! Akari's research ranges from careful benchmarking to solid methodology. She always gives sharp feedback while being thoughtful and supportive. She stayed driven throughout her PhD and now brings that same energy to her new
1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) ๐ I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in ๐งต
2
3
86
Exciting DR Tulu updates! ๐ DR Tulu-8B (new RL ckpt) sits on the performanceโcost frontier, beating Tongyi DR-30B and matching OpenAI DR/Gemini 3 Pro+Search at a fraction of the cost. Now on arXiv. ๐ฅ๏ธ You can run an interactive CLI demo with open code, almost for free. 1/๐งต
Today weโre releasing Deep Research Tulu (DR Tulu)โthe first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. ๐งญ๐
4
29
151
Olmo3 is here! ๐ Fully open data and fully open training recipes again. ๐ Huge congrats to the whole team!
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flowโnot just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. ๐งต
0
0
21
๐ฅThrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR ๐ชYes, just 8B! ๐ The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics: - co-evolve with the policy model -
8
115
542
New blog post out! ๐ We share our latest research efforts to build more effective, human-centered AI collaboration. Months ago, I was genuinely surprised by how quickly AI agents were improving, and with that came a deep fear of being replaced, of humans slowly losing agency as
xuhuiz.com
Exploring what makes AI agents truly effective for users, beyond benchmark performance.
3
26
122
๐Paper: https://t.co/7pxLXwN7Jb ๐ปGithub: https://t.co/nw4ztyuxqx Huge thanks to the amazing collaborators who made this work possible. @AkariAsai @LukeZettlemoyer @HannaHajishirzi @faeze_brh [7/n]
github.com
Official repo for "Binary Retrieval-augmented Reward Mitigates Hallucinations" - chentong0/rl-binary-rar
1
4
15
๐ก Finding 2: Binary reward works better than continuous reward. Continuous rewards such as VeriScore are vulnerable to reward hacking because the model can raise the score by adding correct but irrelevant claims or by adapting to formats favored by the claim extractor and
1
1
11
In short-form QA, binary RAR sharply cuts incorrect answers while keeping accuracy unchanged when the model is allowed to express uncertainty. Standard binary rewards for verifiable tasks give 1 only for correct answers and 0 for both incorrect and abstaining, which trains
1
1
10
๐ก Finding 1: Utility stays unchanged. A model can reach zero hallucination by always saying โI do not know,โ but then it is not useful. Our approach avoids this. In long-form generation, Binary RAR keeps the number of correct claims unchanged and reduces incorrect claims. [4/n]
1
1
11
With Binary RAR, we cut hallucinations in both long-form generation (61.9โ37.5) and short-form question answering (60.6โ27.6), while core skills remain unchanged. Interestingly, after RL finetuning with binary RAR, the model keeps the same accuracy when forced to answer on
1
1
15
Binary Retrieval-Augmented Reward (Binary RAR) gives a reward of one only when the verifier finds no contradiction between the model output and retrieved evidence, and zero otherwise. No partial credit and little reward hacking. With this stepwise signal, the KL penalty in RL
1
1
17
RL is bounded by finite data๐ฃ? Introducing RLVE: RL with Adaptive Verifiable Environments We scale RL with data procedurally generated from 400 envs dynamically adapting to the trained model ๐กfind supervision signals right at the LM capability frontier + scale them ๐in๐งต
12
115
475
Our infini-gram mini paper received the Best Paper Award at #EMNLP2025 !! Really proud ๐ฅน
Wanna ๐ inside Internet-scale LLM training data w/o spending ๐ฐ๐ฐ๐ฐ? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram ๐ We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) โฌ๏ธ
20
19
351
Model collaboration talk tour continues~ Compositional intelligence. Collaborative development. Decentralized AI. By the Many. The methods. The vision. The hot takes. The comedy. If you are around one of these places, let's chat!
1
8
41
How to build agentic search systems for long-horizon tasks? Check out our new paper! - Simple design principles are efficient and effective - Error analysis and fine-grain analysis for search systems A ๐งต on SLIM, our long-horizon agentic search framework
1
14
42