Bing Liu
@vbingliu
Followers
1K
Following
47
Media
15
Statuses
74
Head of Research @Scale_AI, ex-Meta, Llama 3, PhD @CarnegieMellon.
Mountain View, CA
Joined February 2016
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
55
111
1K
@scale_AI @ai_risks Resources 📊Leaderboard: https://t.co/Snc2qAVeAQ 📰Paper: https://t.co/rYvJRPYUO8 📽️Watch the full video:
1
5
27
We’ve also updated the repo + added more requested model runs: https://t.co/165qMVUnYe Big thanks to the community for all the feedback - keep it coming! 🙏 We’re iterating together to make SWE-Bench Pro a reliable benchmark for agentic SWE systems. More to come. ⚡️
0
0
1
Besides the new results, we’re also releasing: 📂 Full trajectories for these runs + previous models → https://t.co/phI9gN1ArE 🎛️ Interactive visualization on Docent → https://t.co/LWQQupJrIV Transparency & reproducibility continue to be core to this project.
1
0
0
SWE-Bench Pro is designed for realistic, end-to-end SWE tasks: debugging, reasoning, editing, and verifying code. The uncapped results show what happens when models can truly think longer and harder. 🧠⚙️
1
0
0
By default, the SWE-Agent scaffold runs with a cost limit of $2-3 per instance. In this update, we remove that limit, and measure the frontier of model capability, closer to how labs would run full-budget evals. This helps separate economic efficiency from raw competence.
1
0
0
🚀 Updated results are out for SWE-Bench Pro! The top model, Sonnet 4.5, now hits 43 %! And in this uncapped setting, Sonnet 4 = 42 %, GPT-5 = 36 %.
We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5
1
0
7
Joint work from our amazing team: Xingang Guo (@Xingang20), Utkarsh Tyagi, Advait Gosai (@agxsai), Paula Vergara, Ernesto Gabriel Hernández Montoya (@eghmontoya), Chen Bo Calvin Zhang (@calvincbzhang), Bin Hu, Yunzhong He (@_yunzhong), Rakshith Sharma Srinivasa (@rsharma9201)
0
0
6
🔭 What’s next We hope VisualToolBench can serve as a testbed for tracking how models blend visual + textual reasoning into a unified multimodal cognition stack. Leaderboard, paper, and dataset 👇 📊 https://t.co/r1eAwag25P 📄 https://t.co/olgNWlOwJz 📂
huggingface.co
1
0
4
In short: Multimodal LLMs can see, but they still struggle to see and interact. VisualToolBench exposes the gap between visual perception and visual reasoning, and how tool use can bridge it.
1
0
2
We evaluated 16 leading MLLMs, and results were eye-opening 👇 1️⃣ GPT-5-think: 18.7 % pass rate (best overall) 2️⃣ OpenAI models benefit from active tool use 3️⃣ Gemini-2.5-pro gains little from tools 4️⃣ 70–80 % of failures = visual perception errors
1
0
2
Most multimodal benchmarks stop at perception. VisualToolBench moves beyond “look & answer” → toward “see, act, and think.” It treats images as a cognitive workspace where reasoning and manipulation meet.
1
0
2
In the real world, images are messy: rotated, blurry, cluttered. To solve these, models must crop, edit, enhance, and reason dynamically. That’s exactly what VisualToolBench measures. 📊 1,200+ open-ended tasks 🧰 Diverse domains with tool-use 🧮 Detailed rubrics
1
0
2
🧠 Can your model think with images? Today we’re releasing VisualToolBench, a new benchmark for multimodal reasoning with tool use, that tests whether multimodal LLMs can think-with-images, not just think about them.
1
3
26
Our team: MohammadHossein Rezaei (@mhrezaeics), Robert Vacareanu (@robert_nlp), Zihao Wang (@wzihao12), Clinton Wang (@clintonjwang), Yunzhong He (@_yunzhong), Feyza Akyürek (@afeyzaakyurek) Paper:
0
1
4
📊 OnlineRubrics consistently outperforms static rubrics (both human-written & synthetic) and “catch-all” rubrics across benchmarks. It delivers higher win rates (up to +8%) on open-ended tasks and stronger results in expert domains like GPQA & GSM8K.
2
0
3
✅ How OnlineRubrics works: 1. Compare responses from the current model vs. a control model 2. Use an LLM to extract new criteria from their differences 3. Add those criteria to the static rubrics in real time Rubrics adapt as the model trains.
1
1
4
Rubrics highlight desired behaviors but under-specify undesired ones. This makes static rubrics easy to hack (e.g. self-praising: “the following advice is the most relevant”). And "catch-all" rubrics aren't enough because they miss emergent behaviors as models evolve.
1
0
4
🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. https://t.co/YI6pJ7jfJ1
5
42
266