Holistic Agent Leaderboard (hal.cs.princeton.edu)
@halevals
Followers
57
Following
7
Media
0
Statuses
12
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
Princeton
Joined September 2025
UPDATE: We evaluated 16 models using two scaffolds on GAIA. Claude Sonnet 4.5 currently leads the GAIA leaderboard with 75% accuracy. 🧵
1
3
7
We evaluated Gemini Pro 3 and Claude 4.5 Opus, Sonnet, and Haiku on CORE-Bench. - CORE-Bench consists of scientific reproduction tasks. Agents have to reproduce scientific papers using the code and data for the paper. - Opus 4.1 continues to have the highest accuracy on
8
17
143
In our most recent evaluations at @halevals, we found Claude Opus 4.5 solves CORE-Bench. How? Opus 4.5 solves CORE-Bench because it creatively resolves dependency conflicts, bypasses environmental barriers via nuanced benchmark editing, and follows instructions with high
1
13
59
📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
20
101
425
Today we are announcing the creation of the AI Evaluator Forum: a consortium of leading AI research organizations focused on independent, third-party evaluations. Founding AEF members: @TransluceAI @METR_Evals @RANDCorporation @halevals @SecureBio @collect_intel @Miles_Brundage
6
51
166
CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses
27
110
779
We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.
6
51
363
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
1
13
41
We have added ScienceAgentBench to HAL and evaluated it with leading models (GPT-5, o3, Opus 4.1). o3 tops the leaderboard at a lower cost than GPT-5, Opus 4.1, and Sonnet 3.7 High. o4-mini Low is much cheaper than the crowd, but with similar accuracy. Grateful to so many
🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery? We just incorporated ScienceAgentBench into @PrincetonCITP’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress! 👇 A quick tour of
0
4
25
Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). We evaluated 9 models (including GPT-5 and Sonnet 4)
12
31
134
GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. DeepSeek V3 scores 18%. GPT-OSS scores 11%. https://t.co/EVxxSqKFMe
gpt-oss is a tool processing / reasoning engine only. Kind of a hard open model to use. Traction imo will be limited. Best way to get traction is to release models that are flexible, easy to use w/o tools, and reliable. Then, bespoke interesting models like tool use later
1
7
36
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵
31
73
428