Scale AI
@scale_AI
Followers
71K
Following
2K
Media
538
Statuses
2K
making AI work
Joined July 2016
Big news: Scale is growing 🌍 We’re expanding our global footprint with new offices in New York City, London, Washington, D.C., and St. Louis. This growth reflects our investment in our people and our mission to build reliable AI systems for the world’s most important
2
6
30
We’re launching the Remote Labor Index (RLI) with @ai_risks, the first benchmark evaluating whether AI agents can independently complete full, paid freelance tasks. The results provide a needed reality check: automation is advancing, but still has a long way to go. RLI offers a
Can AI automate jobs? We created the Remote Labor Index to test AI’s ability to automate hundreds of long, real-world, economically valuable projects from remote work platforms. While AIs are smart, they are not yet that useful: the current automation rate is less than 3%.
4
10
39
Our research team dives into MCP Atlas, one of our newest benchmarks – exploring how it evaluates models and what we’ve learned from the results.
2
5
31
There’s no magic wand for making AI work. Scale CEO @jdroege joined @richardquest on @cnni to share what it really takes:
1
2
17
We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
39
54
558
Learn more about our methodology and see how models stack up:
scale.com
Explore the SEAL leaderboard with expert-driven LLM benchmarks and updated AI model leaderboards, ranking top models across coding, reasoning and more.
0
1
2
📣 Releasing our newest benchmark, VisualToolBench (VTB), the first benchmark designed to evaluate how well multimodal large language models (MLLMs) can dynamically interact with and reason about visual information. VTB goes beyond thinking about images, it’s about thinking with
2
5
22
🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. https://t.co/YI6pJ7jfJ1
5
42
266
Sat down with @lennysan to talk about where AI is headed and how we’re making it work for model builders, enterprises and governments. Also went down memory lane about my time at Uber Eats. 🙂
In his first in-depth interview since taking over as @scale_AI CEO, @jdroege shares: 🔸 What actually happened with Meta’s $14 billion investment 🔸 Where frontier labs are heading next 🔸 Why most enterprise data is useless for AI models 🔸 What it takes to keep making AI model
3
5
36
Welcome to Chain of Thought, exploring all things AI, research, and evaluations. This episode: how we think about different types of agents and where they’re headed next.
1
2
25
“I think one of the misunderstandings is that AI is this magic wand or it can solve all problems, and that’s not true today. But there is a ton of value when you get it right.” Our CEO @jdroege shared his AI success framework with CNN's @claresduffy. https://t.co/pmBKjdivLt
cnn.com
The artificial intelligence industry has a big problem: 95% of companies that try AI aren’t making any money from it, according to a report from the Massachusetts Institute of Technology last month....
3
4
18
New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training! https://t.co/D6aJkZ8zZE
4
40
178