Ben Shi Profile
Ben Shi

@BenShi34

Followers
314
Following
504
Media
19
Statuses
97

Human centered NLP | SF 🌉 | prev @princeton_NLP @meta

New Jersey, USA
Joined April 2024
Don't wanna be here? Send us removal request.
@BenShi34
Ben Shi
6 months
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:
6
42
184
@karthik_r_n
Karthik Narasimhan
2 days
This is not reward hacking. The policy in tau-airline has this by design and one of the tasks even makes use of it. We've actually observed some other models try this strategy at times before, but decided to keep the task and policy as is since upgrading flights is not something
@alexalbert__
Alex Albert
3 days
We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic
3
2
133
@OfirPress
Ofir Press
14 days
SWE-bench verified results make it seem like we are close to achieving human parity on coding, but users of coding agents know that that's not where we are yet. The solution is to build benchmarks that challenge LMs on even tougher tasks. SWE-fficiency, SciCode & AlgoTune make
@18jeffreyma
Jeff Ma
15 days
We measure agents using Speedup Ratio (SR), encouraging long-term benchmark progress! SR = (agent speedup) / (expert speedup) SR = 1× → expert parity SR > 1× → above-expert optimization! But agents fall short, struggling to match expert performance or maintain correctness!
2
11
83
@jyangballin
John Yang ✈️ NeurIPS
22 days
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
28
92
374
@HowardYen1
Howard Yen
1 month
How to build agentic search systems for long-horizon tasks? Check out our new paper! - Simple design principles are efficient and effective - Error analysis and fine-grain analysis for search systems A 🧵 on SLIM, our long-horizon agentic search framework
1
14
41
@sayashk
Sayash Kapoor
1 month
📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
20
100
425
@a1zhang
Alex L Zhang
1 month
What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,
126
356
3K
@alexisjross
Alexis Ross
1 month
Can LLMs reason like a student? 👩🏻‍🎓📚✏️ For educational tools like AI tutors, modeling how students make mistakes is crucial. But current LLMs are much worse at simulating student errors ❌ than performing correct ✅ reasoning. We try to fix that with our method MISTAKE 🤭👇
11
55
337
@SierraPlatform
Sierra
2 months
Check out 𝜏-Bench's new leaderboard. High-level metrics are great but they're more valuable when third parties can inspect how the results were achieved. Leaderboard makes evaluations more transparent, interactive, and community-driven.
3
2
10
@alexisjross
Alexis Ross
2 months
One of my fave papers from COLM by @BenShi34! Really cool eval too (a friends and family Turing test!)
@m2saxon
Michael Saxon
2 months
IMPersona from @BenShi34 et al #COLM2025 They trained LMs on participants' real chat logs. Then they brought in *each participant's friend* to do a personalized Turing test. The chatbot was surprisingly effective, passing as the friend 44% of the time. LMs can impersonate YOU🫵
1
3
33
@m2saxon
Michael Saxon
2 months
IMPersona from @BenShi34 et al #COLM2025 They trained LMs on participants' real chat logs. Then they brought in *each participant's friend* to do a personalized Turing test. The chatbot was surprisingly effective, passing as the friend 44% of the time. LMs can impersonate YOU🫵
1
5
27
@BenShi34
Ben Shi
2 months
Thinking a lot about user simulation lately. IMPersona offers a testbed for high-fidelity sims: if an LLM is output-wise indistinguishable from a specific person, it should approximate that person’s policy across environments. Check out IMPersona @COLM_conf Wed AM, poster #1!
@BenShi34
Ben Shi
8 months
Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… 🧵👇
0
1
10
@realJessyLin
Jessy Lin
2 months
What does it take to build a human-like user simulator? // To train collaborative agents, we need better user sims. In blog post pt 2, @NickATomlin and I sketch a framework for building user simulators + open questions for research: https://t.co/FD0dRt22lR
Tweet card summary image
jessylin.com
3
11
59
@letti_liu
Sijia Liu
2 months
Is online alignment the only path to go despite being slow and computationally expensive? Inspired by prospect theory, we provide a human-centric explanation for why online alignment (e.g. GRPO) outperforms offline alignment (e.g. DPO, KTO) and empirically show how to close the
4
35
176
@BenShi34
Ben Shi
2 months
Accepted to #NeurIPS2025! Big shoutout to our ~120 participants, who graciously allowed me to pester them daily with reminder emails, bug fixes, and troubleshooting queries 😓
@BenShi34
Ben Shi
6 months
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:
0
1
14
@BenShi34
Ben Shi
4 months
Thanks for reposting! I can’t believe I just noticed this 😓
@_akhaliq
AK
6 months
When Models Know More Than They Can Explain Quantifying Knowledge Transfer in Human-AI Collaboration
0
1
10
@BenShi34
Ben Shi
5 months
IMPersona is accepted at #COLM2025 + recommended for oral! Check out our work on imbuing human personality and memories into LLMs, allowing them to evade detection even by close friends and family. @COLM_conf
@BenShi34
Ben Shi
8 months
Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… 🧵👇
0
0
6
@ori_press
Ori Press
5 months
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
6
63
159
@BenShi34
Ben Shi
6 months
This and lots more insights (like a trajectory visualizer) at https://t.co/0vklaV6pgt. Thanks to Carlos, Diyi, Nick, Shunyu, and Karthik for helping make this happen! I’ve wanted to do this project for an entire year and it’s so rewarding finally seeing it come to fruition :)
kite-live.vercel.app
Research on mechanisms and dynamics of knowledge transfer in human-AI collaborative settings.
0
0
7
@BenShi34
Ben Shi
6 months
As we build more powerful AI, we need to be equally intentional about building AI that can effectively teach and collaborate with humans: otherwise we risk creating a world of powerful but incomprehensible AI assistants.
1
0
5
@BenShi34
Ben Shi
6 months
By clustering embeddings of user queries, feedback, and model responses, we see a nuanced picture of the interaction types that drive effective/non-effective knowledge transfer in collaboration.
1
0
3