Scale AI
@scale_AI
Followers
72K
Following
2K
Media
563
Statuses
2K
making AI work
Joined July 2016
We recently introduced MCP-Atlas, a benchmark for evaluating how well LLMs handle tool use via the Model Context Protocol. Even top models failed nearly half of realistic multi-tool tasks. Today, we’re open-sourcing the benchmark so you can measure performance yourself.
1
5
21
Speech isn’t just text read out loud. 💬 Real conversations are dynamic, full of interruptions, and context-rich — and benchmarks should match. Introducing Audio MultiChallenge (Audio MC), the first benchmark built to test how well native Speech-to-Speech models handle real
2
2
22
Major drop today by @GoogleAI! ⚡️ Gemini 3 Flash scored🥈on MCP Atlas and tracking strong on Humanity’s Last Exam.
Introducing Gemini 3 Flash, our frontier intelligence model, available at scale for everyone. It excels at coding, tool calling, and is stronger than 2.5 Pro across most metrics!! ⚡️ Available in the API at $0.50 in / 1M tokens and $3.00 out / 1M tokens across.
2
2
23
GPT-5 Pro for very hard problems:
GPT-5 Pro by @OpenAI is the Best Reasoning Model of 2025. 🏆 Calculated across SEAL’s reasoning leaderboards, GPT-5 Pro was the best at answering complicated questions, explaining its thinking, and solving multi-step problems.
24
19
425
talked to Scale's head of research about creating the Oscars for AI
sources.news
Scale's head of research: “Evaluation is falling behind the development of model capabilities."
1
5
21
GPT-5 Chat by @OpenAI and Claude Sonnet 4.5 by @AnthropicAI are the People’s Favorite Models of 2025.🏆 Determined by performance on SEAL Showdown, where real users pick the better response in head-to-head comparisons, GPT-5 Chat and Sonnet 4.5 were the big winners.
0
2
18
Claude Opus 4.5 by @AnthropicAI is the Best Agentic Model of 2025. 🏆 Across leaderboards that test models on ambiguous tasks — like multi-step projects and debugging — Opus 4.5 was the top performer.
1
5
24
Claude Sonnet 4.5 by @AnthropicAI is the Best Safety Model of 2025. 🏆 Measuring across all safety evaluations, Sonnet 4.5 excelled at staying consistent, following safety guidelines, and avoiding unsafe outputs, even when under pressure.
1
2
21
See the full list of winners:
scale.com
Which models ruled 2025? Based on 450+ evals, see who topped the charts in the inaugural SEAL Models of the Year Awards.
0
2
7
Introducing Scale’s Model of the Year Awards. 🏆 These awards, based entirely on SEAL Leaderboard performance, celebrate the best models across six major categories.
2
4
24
Hundreds of models stand before us, but we only have six photos in our hands. Tune in tomorrow, December 16th to see who will be crowned Scale’s Next Top AI Models of 2025.
1
10
29