halevals Profile Banner
Holistic Agent Leaderboard (hal.cs.princeton.edu) Profile
Holistic Agent Leaderboard (hal.cs.princeton.edu)

@halevals

Followers
57
Following
7
Media
0
Statuses
12

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

Princeton
Joined September 2025
Don't wanna be here? Send us removal request.
@ndzfs
Franck SN
2 months
UPDATE: We evaluated 16 models using two scaffolds on GAIA. Claude Sonnet 4.5 currently leads the GAIA leaderboard with 75% accuracy. 🧵
1
3
7
@sayashk
Sayash Kapoor
25 days
We evaluated Gemini Pro 3 and Claude 4.5 Opus, Sonnet, and Haiku on CORE-Bench. - CORE-Bench consists of scientific reproduction tasks. Agents have to reproduce scientific papers using the code and data for the paper. - Opus 4.1 continues to have the highest accuracy on
8
17
143
@PKirgis
Peter Kirgis
5 days
In our most recent evaluations at @halevals, we found Claude Opus 4.5 solves CORE-Bench. How? Opus 4.5 solves CORE-Bench because it creatively resolves dependency conflicts, bypasses environmental barriers via nuanced benchmark editing, and follows instructions with high
1
13
59
@sayashk
Sayash Kapoor
2 months
📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
20
101
425
@aievalforum
AI Evaluator Forum
17 days
Today we are announcing the creation of the AI Evaluator Forum: a consortium of leading AI research organizations focused on independent, third-party evaluations. Founding AEF members: @TransluceAI @METR_Evals @RANDCorporation @halevals @SecureBio @collect_intel @Miles_Brundage
6
51
166
@sayashk
Sayash Kapoor
18 days
CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses
27
110
779
@sayashk
Sayash Kapoor
3 months
We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.
6
51
363
@PKirgis
Peter Kirgis
3 months
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
1
13
41
@sayashk
Sayash Kapoor
3 months
We have added ScienceAgentBench to HAL and evaluated it with leading models (GPT-5, o3, Opus 4.1). o3 tops the leaderboard at a lower cost than GPT-5, Opus 4.1, and Sonnet 3.7 High. o4-mini Low is much cheaper than the crowd, but with similar accuracy. Grateful to so many
@RonZiruChen
Ziru Chen
3 months
🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery? We just incorporated ScienceAgentBench into @PrincetonCITP’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress! 👇 A quick tour of
0
4
25
@sayashk
Sayash Kapoor
4 months
Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). We evaluated 9 models (including GPT-5 and Sonnet 4)
12
31
134
@sayashk
Sayash Kapoor
4 months
GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. DeepSeek V3 scores 18%. GPT-OSS scores 11%. https://t.co/EVxxSqKFMe
@natolambert
Nathan Lambert
4 months
gpt-oss is a tool processing / reasoning engine only. Kind of a hard open model to use. Traction imo will be limited. Best way to get traction is to release models that are flexible, easy to use w/o tools, and reliable. Then, bespoke interesting models like tool use later
1
7
36
@sayashk
Sayash Kapoor
4 months
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵
31
73
428