Bing Liu @vbingliu X Profile

Bing Liu

@vbingliu

Followers

1K

Following

47

Media

15

Statuses

74

Head of Research @Scale_AI, ex-Meta, Llama 3, PhD @CarnegieMellon.

https://t.co/9o9fx9rmMt

Mountain View, CA

Joined February 2016

Don't wanna be here? Send us removal request.

Bing Liu

@vbingliu

2 months

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

55

111

1K

Bing Liu

@vbingliu

11 days

@scale_AI @ai_risks Resources 📊Leaderboard: https://t.co/Snc2qAVeAQ 📰Paper: https://t.co/rYvJRPYUO8 📽️Watch the full video:

1

5

27

Bing Liu

@vbingliu

11 days

Can AI actually automate jobs? @Scale_AI and @ai_risks are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data

22

75

461

Bing Liu

@vbingliu

19 days

We’ve also updated the repo + added more requested model runs: https://t.co/165qMVUnYe Big thanks to the community for all the feedback - keep it coming! 🙏 We’re iterating together to make SWE-Bench Pro a reliable benchmark for agentic SWE systems. More to come. ⚡️

0

1

Bing Liu

@vbingliu

19 days

Besides the new results, we’re also releasing: 📂 Full trajectories for these runs + previous models → https://t.co/phI9gN1ArE 🎛️ Interactive visualization on Docent → https://t.co/LWQQupJrIV Transparency & reproducibility continue to be core to this project.

1

0

Bing Liu

@vbingliu

19 days

SWE-Bench Pro is designed for realistic, end-to-end SWE tasks: debugging, reasoning, editing, and verifying code. The uncapped results show what happens when models can truly think longer and harder. 🧠⚙️

1

0

Bing Liu

@vbingliu

19 days

By default, the SWE-Agent scaffold runs with a cost limit of $2-3 per instance. In this update, we remove that limit, and measure the frontier of model capability, closer to how labs would run full-budget evals. This helps separate economic efficiency from raw competence.

1

0

Bing Liu

@vbingliu

19 days

🚀 Updated results are out for SWE-Bench Pro! The top model, Sonnet 4.5, now hits 43 %! And in this uncapped setting, Sonnet 4 = 42 %, GPT-5 = 36 %.

Scale AI

@scale_AI

19 days

We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5

1

0

7

Bing Liu

@vbingliu

25 days

Joint work from our amazing team: Xingang Guo (@Xingang20), Utkarsh Tyagi, Advait Gosai (@agxsai), Paula Vergara, Ernesto Gabriel Hernández Montoya (@eghmontoya), Chen Bo Calvin Zhang (@calvincbzhang), Bin Hu, Yunzhong He (@_yunzhong), Rakshith Sharma Srinivasa (@rsharma9201)

0

6

Bing Liu

@vbingliu

25 days

🔭 What’s next We hope VisualToolBench can serve as a testbed for tracking how models blend visual + textual reasoning into a unified multimodal cognition stack. Leaderboard, paper, and dataset 👇 📊 https://t.co/r1eAwag25P 📄 https://t.co/olgNWlOwJz 📂

huggingface.co

1

0

4

Bing Liu

@vbingliu

25 days

In short: Multimodal LLMs can see, but they still struggle to see and interact. VisualToolBench exposes the gap between visual perception and visual reasoning, and how tool use can bridge it.

1

0

2

Bing Liu

@vbingliu

25 days

We evaluated 16 leading MLLMs, and results were eye-opening 👇 1️⃣ GPT-5-think: 18.7 % pass rate (best overall) 2️⃣ OpenAI models benefit from active tool use 3️⃣ Gemini-2.5-pro gains little from tools 4️⃣ 70–80 % of failures = visual perception errors

1

0

2

Bing Liu

@vbingliu

25 days

Most multimodal benchmarks stop at perception. VisualToolBench moves beyond “look & answer” → toward “see, act, and think.” It treats images as a cognitive workspace where reasoning and manipulation meet.

1

0

2

Bing Liu

@vbingliu

25 days

In the real world, images are messy: rotated, blurry, cluttered. To solve these, models must crop, edit, enhance, and reason dynamically. That’s exactly what VisualToolBench measures. 📊 1,200+ open-ended tasks 🧰 Diverse domains with tool-use 🧮 Detailed rubrics

1

0

2

Bing Liu

@vbingliu

25 days

🧠 Can your model think with images? Today we’re releasing VisualToolBench, a new benchmark for multimodal reasoning with tool use, that tests whether multimodal LLMs can think-with-images, not just think about them.

1

3

26

Bing Liu

@vbingliu

1 month

https://t.co/nRzvzUS809

0

1

Bing Liu

@vbingliu

1 month

Our team: MohammadHossein Rezaei (@mhrezaeics), Robert Vacareanu (@robert_nlp), Zihao Wang (@wzihao12), Clinton Wang (@clintonjwang), Yunzhong He (@_yunzhong), Feyza Akyürek (@afeyzaakyurek) Paper:

0

1

4

Bing Liu

@vbingliu

1 month

📊 OnlineRubrics consistently outperforms static rubrics (both human-written & synthetic) and “catch-all” rubrics across benchmarks. It delivers higher win rates (up to +8%) on open-ended tasks and stronger results in expert domains like GPQA & GSM8K.

2

0

3

Bing Liu

@vbingliu

1 month

✅ How OnlineRubrics works: 1. Compare responses from the current model vs. a control model 2. Use an LLM to extract new criteria from their differences 3. Add those criteria to the static rubrics in real time Rubrics adapt as the model trains.

1

4

Bing Liu

@vbingliu

1 month

Rubrics highlight desired behaviors but under-specify undesired ones. This makes static rubrics easy to hack (e.g. self-praising: “the following advice is the most relevant”). And "catch-all" rubrics aren't enough because they miss emergent behaviors as models evolve.

1

0

4

Bing Liu

@vbingliu

1 month

🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. https://t.co/YI6pJ7jfJ1

5

42

266