
Yu Su (hiring postdoc)
@ysu_nlp
Followers
11K
Following
4K
Media
132
Statuses
2K
cooking something new. prof. @osunlp. sloan fellow. intelligence and agents. author of Mind2Web, SeeAct, MMMU, HippoRAG, BioCLIP, UGround.
Columbus, OH
Joined March 2013
Sharing the slides of my talk at Princeton yesterday--"A holistic and critical look at language agents":. LLM-based language agents are exciting, but it's also undeniably a quite chaotic space: are agents the next big thing, or are they just thin wrappers
16
123
516
Excited to receive the NSF CAREER Award! . Grateful for all the support and encouragement I've received in the 6 years of faculty life so far, especially for my extremely supportive family and for the amazing students @osunlp I have had the privilege to work with!!
23
11
254
RT @_zifan_wang: š§µ (1/9) New @scale_AI research paper: "Search-Time Data Contamination" (STC), which occurs in evaluating search-based LLMā¦.
0
19
0
RT @_zifan_wang: š§µ (6/9) Our findings suggest that traditional capability benchmarks may not adequately assess search-base LLM agents. We rā¦.
arxiv.org
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how...
0
5
0
Even though benchmarks are becoming less relevant, I must say this is a very impressive set of results and the cooler way to flex about benchmark numbers.
Introducing GLM-4.5V: a breakthrough in open-source visual reasoning. GLM-4.5V delivers state-of-the-art performance among open-source models in its size class, dominating across 41 benchmarks. Built on the GLM-4.5-Air base model, GLM-4.5V inherits proven techniques from
3
2
14
Excited to partner with the Princeton team on Holistic Agent Leaderboard!. Claude continues to be the best choice for agent tasks, but overall we still have a long way to go as a field.
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? . Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals š§µ
1
4
20
RT @xiangyue96: 20 months after our multimodal reasoning benchmark MMMU ( release, both frontier and open models arā¦.
0
20
0
Congrats @OpenAI on the MMMU improvement!. Also, @AnthropicAI should take notes here for the art of making bar charts.
7
1
23
Hmm, looks kind of familiar: . LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error - ACLā24.
arxiv.org
Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily...
Announcing MCPā¢RL: teach your model how to use any MCP server automatically using reinforcement learning!. Just connect any MCP server, and your model will start playing with it and (using RL) "learn from experience" how to use its tools most effectively!
4
3
24
š¤I thought @windsurf's Claude access was revoked?
Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning.
0
0
3
RT @nouhadziri: Come join us tomorrow at the 1st LLM agents workshop in ACL (REALM); amazing talks and oral presentations are ahead, with aā¦.
0
18
0
RT @boyuan__zheng: Remember āSon of Antonā from the Silicon Valley show(@SiliconHBO)? The experimental AI that āefficientlyā orders 4,000 lā¦.
0
29
0
RT @vardaanpahuja: š Excited to share our #ACL2025 Findings paper:.Explorer ā a scalable pipeline that generates diverse web trajectories vā¦.
0
13
0
Safety is one of the biggest blockers for computer use agents: how can I trust an agent wonāt accidentally do something consequential without my permission? . We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best
As AI agents start taking real actions online, how do we prevent unintended harm?. We teamed up with @OhioState and @UCBerkeley to create WebGuard: the first dataset for evaluating web agent risks and building real-world safety guardrails for online environments. š§µ.
0
20
100
RT @vimar_gu: Announcing the @NeurIPSConf 2025 workshop on Imageomics:.Discovering Biological Knowledge from Images Using AI!. The workshopā¦.
0
15
0