Yuhan Liu Profile
Yuhan Liu

@YuhanLiu_nlp

Followers
465
Following
1K
Media
9
Statuses
149

CS PhD student @NYU_Courant advised by @eunsolc, previous intern @tsvetshop

NYC
Joined March 2022
Don't wanna be here? Send us removal request.
@YuhanLiu_nlp
Yuhan Liu
15 days
👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025
Tweet media one
2
21
75
@jaseweston
Jason Weston
10 days
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
Tweet media one
4
81
406
@eunsolc
Eunsol Choi
15 days
We looked into user interaction logs, searching for learning signal to improve LLMs. We find user data noisy and nontrivial to leverage, yet we learned a lot about user behaviors! See our #EMNLP25 paper.
@YuhanLiu_nlp
Yuhan Liu
15 days
👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025
Tweet media one
3
8
65
@YuhanLiu_nlp
Yuhan Liu
15 days
Q: Can we use it to provide targeted improvement to the model? A: Depends on the evaluation domain! Finetuning with feedback semantics shows improvement on MTBench, but not on WildBench. We hypothesize that this is because WildBench has more clearly specified instructions.
Tweet media one
1
1
4
@YuhanLiu_nlp
Yuhan Liu
15 days
Q: When does such user feedback occur? A: User feedback can be a result of their low-quality initial requests. We find several factors that correlate with later feedback, such as the quality of user instructions.
Tweet media one
1
0
2
@shangbinfeng
Shangbin Feng
24 days
Two caveats with self-alignment: ⚠️ A single model struggles to reliably judge its own generation. ⚠️ A single model struggles to reliably generate diverse responses to learn from. 👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.
Tweet media one
2
10
34
@shangbinfeng
Shangbin Feng
24 days
🙅 We don’t need LLMs as solvers of graph problems like shortest path. (already have Dijkstra) 🤔 What we need: learn from synthetic graphs and generalize to real-world problems with graph implications. (e.g. multi-hop QA) 💡 We unlock this generalization with alignment and RL.
Tweet media one
2
14
74
@anuj_diwan
Anuj Diwan
23 days
ParaSpeechCaps has been accepted to the EMNLP 2025 Main Conference!
@anuj_diwan
Anuj Diwan
6 months
Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models! Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.
Tweet media one
0
3
40
@shangbinfeng
Shangbin Feng
24 days
🔎 Can you guess these words? [answer @ end of thread] 🤔 Will VLMs decipher human creativity behind these Minecraft builds and work out the answers? 👉 Introducing GuessBench, evaluating VLM creativity understanding with data sourced from human players and real-world gameplay.
Tweet media one
Tweet media two
2
7
12
@shangbinfeng
Shangbin Feng
24 days
👀 How to find more difficult/novel/salient evaluation data? ✨ Let the data generators find it for you! Introducing Data Swarms, multiple data generator LMs collaboratively search in the weight space to optimize quantitative desiderata of evaluation.
Tweet media one
2
17
114
@gregd_nlp
Greg Durrett
1 month
📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please
Tweet media one
Tweet media two
94
45
765
@jaseweston
Jason Weston
2 months
🌿Introducing MetaCLIP 2 🌿 📝: https://t.co/RyytqxRAw3 code, model: https://t.co/P0POS9E2EC After four years of advancements in English-centric CLIP development, MetaCLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing
Tweet media one
14
68
331
@oshaikh13
Omar Shaikh
2 months
BREAKING NEWS! Most people aren’t prompting models with IMO problems :) They’re prompting with tasks that need more context, like “plz make talk slides.” In an ACL oral, I’ll cover challenges in human-LM grounding (in 60K+ real interactions) & introduce a benchmark: RIFTS. 🧵
Tweet media one
5
52
273
@brunchavecmoi
Fangyuan Xu
2 months
At #ACL2025 this week! Please come check out our poster 11:00a.m. to 12:30p.m. today at Hall 4/5.
@brunchavecmoi
Fangyuan Xu
6 months
Can we generate long text from compressed KV cache? We find existing KV cache compression methods (e.g., SnapKV) degrade rapidly in this setting. We present 𝐑𝐞𝐟𝐫𝐞𝐬𝐡𝐊𝐕, an inference method which ♻️ refreshes the smaller KV cache, which better preserves performance.
Tweet media one
1
4
36
@JanePan_
Jane Pan
2 months
I'll be at ACL Vienna 🇦🇹 next week presenting this work! If you're around, come say hi on Monday (7/28) from 18:00–19:30 in Hall 4/5. Would love to chat about code model benchmarks 🧠, simulating user interactions 🤝, and human-centered NLP in general!
@JanePan_
Jane Pan
7 months
When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6]
Tweet media one
1
5
52
@tsvetshop
tsvetshop
2 months
Novel cognitive science grounded approach to preference modeling: synthetic counterfactual training + attention-based attribute integration. Empirical validation across 45 communities with human evaluation confirming interpretability claims! 🌟
@StellaLisy
Stella Li
2 months
WHY do you prefer something over another? Reward models treat preference as a black-box😶‍🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵
Tweet media one
0
4
8
@hhexiy
He He
3 months
Talking to ChatGPT isn’t like talking to a collaborator yet. It doesn’t track what you really want to do—only what you just said. Check out work led by @jcyhc_ai and @rico_angel that shows how attackers can exploit this, and a simple fix: just look at more context!
@jcyhc_ai
John (Yueh-Han) Chen
3 months
LLMs won’t tell you how to make fake IDs—but will reveal the layouts/materials of IDs and make realistic photos if asked separately. 💥Such decomposition attacks reach 87% success across QA, text-to-image, and agent settings! 🛡️Our monitoring method defends with 93% success! 🧵
Tweet media one
2
7
27
@xiye_nlp
Xi Ye
8 months
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize
Tweet media one
Tweet media two
3
49
220
@SonglinYang4
Songlin Yang
4 months
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks https://t.co/nJItUuYKWZ
Tweet card summary image
arxiv.org
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
9
92
542