
Tao Yu
@taoyds
Followers
5K
Following
1K
Media
39
Statuses
450
@XLangNLP lab, asst. prof. @HKUniversity. author of OSWorld, Aguvis, Spider, OpenAgents, Text2Reward, Instructor. prev. postdoc @uwnlp; phd @Yale.
Seattle
Joined March 2016
RT @FaZhou_998: 🐙Octothinker tech report is finally out!.We also release the 70B math-focusing mid-training dataset -- MegaMath-Web-Pro-Max….
0
17
0
RT @CaimingXiong: Graphical user interface (GUI) grounding, one of the two key abilities (Grounding & Planning) for Computer-use Agent (e.g….
0
34
0
Try out Claude 4 on Computer Agent Arena!.
💠Claude Opus 4 & Claude Sonnet 4.Welcome to the Computer Agent Arena🔥.Congratulations on the @AnthropicAI team for the great release!
0
0
3
RT @XLangNLP: 💠Claude Opus 4 & Claude Sonnet 4.Welcome to the Computer Agent Arena🔥.Congratulations on the @AnthropicAI team for the great….
0
4
0
Big congrats, Wei-Lin!.
Thrilled to announce our funding round — led by @a16z, @UofCalifornia, and a strong group of backers to grow @lmarena_ai!. We're building the infrastructure for open, reliable AI evaluation, and your feedback drives us forward. Try out our new UI today. More updates coming soon!.
0
0
2
RT @workshopcua: We're excited to invite Victor Zhong (@hllo_wrld) as a speaker at the workshop on Computer Use Agents - @icmlconf 2025! 🤖….
0
3
0
RT @ysu_nlp: New AI/LLM Agents Track at #EMNLP2025! . In the past few years, it feels a bit odd to submit agent work to *CL venues because….
0
22
0
RT @Diyi_Yang: 🚀 Introducing CAVA: The Comprehensive Assessment for Voice Assistants. A new benchmark for evaluating end-to-end, speech-in-….
0
32
0
🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. Computer Agent Arena tests crowdsourced real-world tasks. OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7.CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5. 🚀Rankings likely to evolve quickly
🏆 Leaderboard Update!.🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI & UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks!. Check the full rankings! 👉
0
11
36
RT @BowenWangNLP: 😀Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study:. 1,….
0
5
0
RT @XLangNLP: 🏆 Leaderboard Update!.🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenA….
0
23
0
RT @Alibaba_Qwen: Introducing Qwen3! . We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 den….
0
2K
0
Computer use often involves long contexts, and users frequently tweak or follow up on requests. Though Claude 3.7/Operator aren’t perfect, this example shows their engaging and instruction-following abilities are growing (see the arena example):
🚀 Exciting news! @OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena!.Test, vote, and explore their full potential with CUAs at Join the community and dive in!
1
5
15
👉Try UI-TARS-1.5 and more other computer use agents (Operator, Claude 3.7) at .
🎉 UI-TARS-1.5 is now live on Computer Agent Arena! . Currently the SOTA model across multiple GUI benchmarks, showcasing leading performance in computer use, browser use, and even gameplay. Want to try the most intelligent CUA so far? Go to
0
0
1