Xi Ye
@xiye_nlp
Followers
3K
Following
335
Media
45
Statuses
225
I study NLP. Postdoc fellow @PrincetonPLI. Incoming assistant professor @UAlberta (Jan 2026). CS PhD @UTAustin.
Joined March 2020
Check out our new work on making reasoning models think broadly! 🤔 We find a minimalist, surprisingly effective recipe to THINK for CHAT: RLVR + a strong reward model, trained on real-world prompts. This project was fun and surprised me in a few ways 👇 📌 We can run RL
Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8
0
22
99
We will present QRHead (@WuweiZhang0723) #EMNLP2025 Without any training, we boosts Llama-3.1-8B’s performance by >10% 📈on context reasoning tasks (CLIPPER, LongMemEval), and outperforms specialized re-rankers on BEIR. Check out our (virtual) poster tonight!
🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a
0
10
36
How to build agentic search systems for long-horizon tasks? Check out our new paper! - Simple design principles are efficient and effective - Error analysis and fine-grain analysis for search systems A 🧵 on SLIM, our long-horizon agentic search framework
1
14
41
We will have a pre-EMNLP workshop about LLMs next Monday at @nyushanghai campus! Speakers are working on diverse and fantastic problems, really looking forward to it! We also provide a zoom link for those who cannot join in person :) (see poster)
3
11
42
Claude Skills shows performance benefits from leveraging LLM skill catalogs at inference time. Our previous work (linked under thread 5/5) showed the same 6 months ago! 🌟Our new work, STAT, shows that leveraging skills during training can greatly help too‼️, e.g., Qwen can
8
43
199
I am going to present two papers at #COLM2025 tomorrow from 4:30-6:30pm, as none of our leading authors can attend due to visa issues. Haven't done poster presentations for years 🤣🤣 .... so I will do my best! #76: LongProc #80: Goedel-Prover v1
Our Goedel-Prover V1 will be presented at COLM 2025 in Montreal this Wednesday afternoon! I won’t be there in person, but my amazing and renowned colleague @danqi_chen will be around to help with the poster — feel free to stop by!
4
27
346
🚨Modeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!
1
22
36
Tried Tinker as a beta user with @AdithyaNLP . As an academic, I find it an amazing platform that makes RL training at >10B scale easily accessible. RLing >10B models on a typical academic setup (single node a few GPUs) is a hassle, but with Tinker I can focus more on the
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
2
8
140
Is online alignment the only path to go despite being slow and computationally expensive? Inspired by prospect theory, we provide a human-centric explanation for why online alignment (e.g. GRPO) outperforms offline alignment (e.g. DPO, KTO) and empirically show how to close the
4
34
174
Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8
3
17
110
👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025
2
26
83
📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please
94
48
763
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
9
96
306
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
194
635
“Our team of research software engineers have played a key role in the cutting-edge research in AI at Princeton that is being picked up by industry and also garnering awards and recognition at leading conferences.” Meet the AI Lab RSEs: https://t.co/2OUOkwBzhE
1
4
10
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
2
38
233
There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize
1
12
34
LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.
5
72
198
Using QRRetriever takes just a few lines of code: Shout out to @WuweiZhang0723 for leading the effort. Joint work with @fangcong_y10593 @HowardYen1 and @danqi_chen checkout the paper and code 📄 https://t.co/Fcb5gUMXi4 💻 https://t.co/XaLg7LUCa7
0
3
9