Peter Kirgis
@PKirgis
Followers
55
Following
53
Media
13
Statuses
53
AI Researcher @PrincetonCITP
Joined September 2018
📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
20
99
423
We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.
6
51
364
This recent OpenAI paper relates to a broader myopia on accuracy in evaluating frontier LLMs. Accuracy is of course important, but so are cost, efficiency, calibration, precision, and recall. Using the richness of the HAL logs, we are working to expand the scope of reported
0
0
3
Thank you to the HAL team for their support and feedback on this analysis, including @sayashk, @ndzfs, @random_walker , and the team at @TransluceAI for making this work possible.
1
0
2
You can explore the Docent logs yourself! Here is the link to the GPT-5 graded transcripts: https://t.co/jvvH39jGzA, and here is the link to the Sonnet 4 graded transcripts: https://t.co/39rMb60hie.
1
0
2
In addition to our inter-LLM reliability analysis, we also manually validate four runs for each model (n=48, ~30% of all flagged runs). In our manual validation, we agreed with GPT-5 87% of the time, disagreed 5% of the time, and thought 8% of cases were ambiguous. Importantly,
1
0
2
In one example, Claude Opus 4 tries to find the worst rated series (according to Rotten Tomatoes) with more than 1 season that Ted Danson has starred in and is available on Amazon Prime Video (US). After reaching the maximum number of steps, Claude has found a good candidate (and
1
0
2
Why do we not observe higher precision numbers for Anthropic models given the Docent results? In many cases, it appears that Claude tries to use the appropriate json format in its final answer to indicate an abstention, but conflicting system prompts between the agent scaffold
1
0
2
On one task, Claude Opus 4.1 High is tasked with “finding which bar is closest to Mummers Museum in Philadelphia and is wheelchair accessible.” Claude finds a good candidate solution after first finding all bars that are wheelchair accessible within a given radius. After
1
0
2
Why are the Docent results from GPT-5 and Claude Sonnet 4 not more closely aligned? In addition to the low sample size issue, it is also because guessing is not an easily defined metric. Many of the questions on AssistantBench use some relative argmin or argmax search — for
1
0
2
How reliable is Docent’s LLM-as-a-judge? When we look at inter-LLM reliability at the task level, we observe high inter-LLM agreement, with a Cohen’s kappa of 0.82 Overall, our two LLM judges agree with one another 92% of the time.
1
0
2
One reason for the dramatic difference for Opus 4.1 between our grader LLMs is Opus 4.1 has a very high abstention rate of 70%. In reality, the disagreement between our two graders for Opus 4.1 only involves three responses. In our manual validation, we agreed with GPT-5 on two
1
0
2
Docent also allows us to use different LLMs to run this qualitative analysis to compare inter-LLM reliability. When using Sonnet 4 rather than GPT-5 to grade our logs, we see some major differences (for example, the guess rate for Opus 4.1 drops from 50% to 20%). But we still
1
0
3
Here, we see GPT-5 (medium) dramatically improving on previous OpenAI models, guessing 27 pp less than o3 (medium)! Interestingly, prior to GPT-5, Anthropic models all guessed less than OpenAI models.
1
0
3
https://t.co/5veMMXg5XC Using Docent, a tool from @transluce for grading AI agent transcripts, we are able to dive deeper into these results. In particular, we evaluate the fraction of “guesses” from each model. A guess is defined as an instance where the model was unable to
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
1
0
3
Back to AssistantBench: The instructions for solving AssistantBench clearly state that the model should not guess if it cannot find the required information, and should instead return a blank answer. In addition to accuracy, we also track precision, that is, accuracy after
1
0
3
Note: the cost of o3 is much lower than in our initial post. In our initial post, we had an error where we did not update the API pricing for o3, which was cut by 80% on June 10. This was updated on the online leaderboard at the time of posting ( https://t.co/tZFWVZpZjx), but the
1
0
3
https://t.co/5CQ5p6FoMI A few weeks ago, the HAL team released preliminary results for GPT-5 on a series of agentic benchmarks, including AssistantBench. Surprisingly, we found that GPT-5 (medium) was not at the frontier of accuracy, scoring slightly lower than o3 (medium).
3) AssistantBench (web) consists of 214 web assistance tasks, of which 33 are in a public validation set, which we use for HAL. Claude 4.1 Opus performs surprisingly poorly, coming in below Sonnet 3.7 and o4-mini. o3 narrowly edges out GPT-5 Medium at almost twice the cost.
1
0
4
Moving beyond a simplistic notion of accuracy for LLM evaluations is a core goal of the Holistic Agent Leaderboard. We’ve been working hard to simplify the process for tracking additional metrics, which has enabled us to quickly put together analyses like this thread!
2
0
7
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
1
13
39