Xi Ye @xiye_nlp X Profile

Xi Ye

@xiye_nlp

Followers

3K

Following

335

Media

45

Statuses

225

I study NLP. Postdoc fellow @PrincetonPLI. Incoming assistant professor @UAlberta (Jan 2026). CS PhD @UTAustin.

https://t.co/wpTTMexbF2

Joined March 2020

Don't wanna be here? Send us removal request.

Xi Ye

@xiye_nlp

2 months

Check out our new work on making reasoning models think broadly! 🤔 We find a minimalist, surprisingly effective recipe to THINK for CHAT: RLVR + a strong reward model, trained on real-world prompts. This project was fun and surprised me in a few ways 👇 📌 We can run RL

Adithya Bhaskar

@AdithyaNLP

2 months

Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8

0

22

99

Xi Ye

@xiye_nlp

10 days

We will present QRHead (@WuweiZhang0723) #EMNLP2025 Without any training, we boosts Llama-3.1-8B’s performance by >10% 📈on context reasoning tasks (CLIPPER, LongMemEval), and outperforms specialized re-rankers on BEIR. Check out our (virtual) poster tonight!

Xi Ye

@xiye_nlp

5 months

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

0

10

36

Howard Yen

@HowardYen1

23 days

How to build agentic search systems for long-horizon tasks? Check out our new paper! - Simple design principles are efficient and effective - Error analysis and fine-grain analysis for search systems A 🧵 on SLIM, our long-horizon agentic search framework

1

14

41

Chen Zhao

@henryzhao4321

17 days

We will have a pre-EMNLP workshop about LLMs next Monday at @nyushanghai campus! Speakers are working on diverse and fantastic problems, really looking forward to it! We also provide a zoom link for those who cannot join in person :) (see poster)

3

11

42

Yinghui He

@yinghui_he_

27 days

Claude Skills shows performance benefits from leveraging LLM skill catalogs at inference time. Our previous work (linked under thread 5/5) showed the same 6 months ago! 🌟Our new work, STAT, shows that leveraging skills during training can greatly help too‼️, e.g., Qwen can

8

43

199

Danqi Chen

@danqi_chen

1 month

I am going to present two papers at #COLM2025 tomorrow from 4:30-6:30pm, as none of our leading authors can attend due to visa issues. Haven't done poster presentations for years 🤣🤣 .... so I will do my best! #76: LongProc #80: Goedel-Prover v1

Chi Jin

@chijinML

1 month

Our Goedel-Prover V1 will be presented at COLM 2025 in Montreal this Wednesday afternoon! I won’t be there in person, but my amazing and renowned colleague @danqi_chen will be around to help with the poster — feel free to stop by!

4

27

346

Tanya Goyal

@tanyaagoyal

2 months

🚨Modeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!

1

22

36

Xi Ye

@xiye_nlp

2 months

Tried Tinker as a beta user with @AdithyaNLP . As an academic, I find it an amazing platform that makes RL training at >10B scale easily accessible. RLing >10B models on a typical academic setup (single node a few GPUs) is a hassle, but with Tinker I can focus more on the

Thinking Machines

@thinkymachines

2 months

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!

2

8

140

Sijia Liu

@letti_liu

2 months

Is online alignment the only path to go despite being slow and computationally expensive? Inspired by prospect theory, we provide a human-centric explanation for why online alignment (e.g. GRPO) outperforms offline alignment (e.g. DPO, KTO) and empirically show how to close the

4

34

174

Adithya Bhaskar

@AdithyaNLP

2 months

Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8

3

17

110

Yuhan Liu

@YuhanLiu_nlp

3 months

👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

2

26

83

Greg Durrett

@gregd_nlp

3 months

📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please

94

48

763

Yong Lin

@Yong18850571

4 months

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B

9

93

262

Weijia Shi

@WeijiaShi2

4 months

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

Ai2

@allen_ai

4 months

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

9

96

306

CLS

@ChengleiSi

5 months

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

12

194

635

Princeton Laboratory for Artificial Intelligence

@PrincetonAInews

5 months

“Our team of research software engineers have played a key role in the cutting-edge research in AI at Princeton that is being picked up by industry and also garnering awards and recognition at leading conferences.” Meet the AI Lab RSEs: https://t.co/2OUOkwBzhE

1

4

10

Adithya Bhaskar

@AdithyaNLP

5 months

There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7

2

38

233

Xi Ye

@xiye_nlp

5 months

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning

Xi Ye

@xiye_nlp

10 months

🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem？ 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize

1

12

34

Leo Liu

@ZEYULIU10

5 months

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

5

72

198

Xi Ye

@xiye_nlp

5 months

Using QRRetriever takes just a few lines of code: Shout out to @WuweiZhang0723 for leading the effort. Joint work with @fangcong_y10593 @HowardYen1 and @danqi_chen checkout the paper and code 📄 https://t.co/Fcb5gUMXi4 💻 https://t.co/XaLg7LUCa7

0

3

9