Bing Liu Profile
Bing Liu

@vbingliu

Followers
1K
Following
47
Media
15
Statuses
74

Head of Research @Scale_AI, ex-Meta, Llama 3, PhD @CarnegieMellon.

Mountain View, CA
Joined February 2016
Don't wanna be here? Send us removal request.
@vbingliu
Bing Liu
2 months
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
55
111
1K
@vbingliu
Bing Liu
11 days
@scale_AI @ai_risks Resources 📊Leaderboard: https://t.co/Snc2qAVeAQ 📰Paper: https://t.co/rYvJRPYUO8 📽️Watch the full video:
1
5
27
@vbingliu
Bing Liu
11 days
Can AI actually automate jobs? @Scale_AI and @ai_risks are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data
22
75
461
@vbingliu
Bing Liu
19 days
We’ve also updated the repo + added more requested model runs: https://t.co/165qMVUnYe Big thanks to the community for all the feedback - keep it coming! 🙏 We’re iterating together to make SWE-Bench Pro a reliable benchmark for agentic SWE systems. More to come. ⚡️
0
0
1
@vbingliu
Bing Liu
19 days
Besides the new results, we’re also releasing: 📂 Full trajectories for these runs + previous models → https://t.co/phI9gN1ArE 🎛️ Interactive visualization on Docent → https://t.co/LWQQupJrIV Transparency & reproducibility continue to be core to this project.
1
0
0
@vbingliu
Bing Liu
19 days
SWE-Bench Pro is designed for realistic, end-to-end SWE tasks: debugging, reasoning, editing, and verifying code. The uncapped results show what happens when models can truly think longer and harder. 🧠⚙️
1
0
0
@vbingliu
Bing Liu
19 days
By default, the SWE-Agent scaffold runs with a cost limit of $2-3 per instance. In this update, we remove that limit, and measure the frontier of model capability, closer to how labs would run full-budget evals. This helps separate economic efficiency from raw competence.
1
0
0
@vbingliu
Bing Liu
19 days
🚀 Updated results are out for SWE-Bench Pro! The top model, Sonnet 4.5, now hits 43 %! And in this uncapped setting, Sonnet 4 = 42 %, GPT-5 = 36 %.
@scale_AI
Scale AI
19 days
We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5
1
0
7
@vbingliu
Bing Liu
25 days
Joint work from our amazing team: Xingang Guo (@Xingang20), Utkarsh Tyagi, Advait Gosai (@agxsai), Paula Vergara, Ernesto Gabriel Hernández Montoya (@eghmontoya), Chen Bo Calvin Zhang (@calvincbzhang), Bin Hu, Yunzhong He (@_yunzhong), Rakshith Sharma Srinivasa (@rsharma9201)
0
0
6
@vbingliu
Bing Liu
25 days
🔭 What’s next We hope VisualToolBench can serve as a testbed for tracking how models blend visual + textual reasoning into a unified multimodal cognition stack. Leaderboard, paper, and dataset 👇 📊 https://t.co/r1eAwag25P 📄 https://t.co/olgNWlOwJz 📂
Tweet card summary image
huggingface.co
1
0
4
@vbingliu
Bing Liu
25 days
In short: Multimodal LLMs can see, but they still struggle to see and interact. VisualToolBench exposes the gap between visual perception and visual reasoning, and how tool use can bridge it.
1
0
2
@vbingliu
Bing Liu
25 days
We evaluated 16 leading MLLMs, and results were eye-opening 👇 1️⃣ GPT-5-think: 18.7 % pass rate (best overall) 2️⃣ OpenAI models benefit from active tool use 3️⃣ Gemini-2.5-pro gains little from tools 4️⃣ 70–80 % of failures = visual perception errors
1
0
2
@vbingliu
Bing Liu
25 days
Most multimodal benchmarks stop at perception. VisualToolBench moves beyond “look & answer” → toward “see, act, and think.” It treats images as a cognitive workspace where reasoning and manipulation meet.
1
0
2
@vbingliu
Bing Liu
25 days
In the real world, images are messy: rotated, blurry, cluttered. To solve these, models must crop, edit, enhance, and reason dynamically. That’s exactly what VisualToolBench measures. 📊 1,200+ open-ended tasks 🧰 Diverse domains with tool-use 🧮 Detailed rubrics
1
0
2
@vbingliu
Bing Liu
25 days
🧠 Can your model think with images? Today we’re releasing VisualToolBench, a new benchmark for multimodal reasoning with tool use, that tests whether multimodal LLMs can think-with-images, not just think about them.
1
3
26
@vbingliu
Bing Liu
1 month
0
0
1
@vbingliu
Bing Liu
1 month
Our team: MohammadHossein Rezaei (@mhrezaeics), Robert Vacareanu (@robert_nlp), Zihao Wang (@wzihao12), Clinton Wang (@clintonjwang), Yunzhong He (@_yunzhong), Feyza Akyürek (@afeyzaakyurek) Paper:
0
1
4
@vbingliu
Bing Liu
1 month
📊 OnlineRubrics consistently outperforms static rubrics (both human-written & synthetic) and “catch-all” rubrics across benchmarks. It delivers higher win rates (up to +8%) on open-ended tasks and stronger results in expert domains like GPQA & GSM8K.
2
0
3
@vbingliu
Bing Liu
1 month
✅ How OnlineRubrics works: 1. Compare responses from the current model vs. a control model 2. Use an LLM to extract new criteria from their differences 3. Add those criteria to the static rubrics in real time Rubrics adapt as the model trains.
1
1
4
@vbingliu
Bing Liu
1 month
Rubrics highlight desired behaviors but under-specify undesired ones. This makes static rubrics easy to hack (e.g. self-praising: “the following advice is the most relevant”). And "catch-all" rubrics aren't enough because they miss emergent behaviors as models evolve.
1
0
4
@vbingliu
Bing Liu
1 month
🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. https://t.co/YI6pJ7jfJ1
5
42
266