Peter Kirgis @PKirgis X Profile

Peter Kirgis

@PKirgis

Followers

55

Following

53

Media

13

Statuses

53

AI Researcher @PrincetonCITP

https://t.co/6jquEaKWhk

Joined September 2018

Don't wanna be here? Send us removal request.

Sayash Kapoor

@sayashk

21 days

📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9

20

99

423

Sayash Kapoor

@sayashk

2 months

We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.

6

51

364

Peter Kirgis

@PKirgis

2 months

This recent OpenAI paper relates to a broader myopia on accuracy in evaluating frontier LLMs. Accuracy is of course important, but so are cost, efficiency, calibration, precision, and recall. Using the richness of the HAL logs, we are working to expand the scope of reported

0

3

Peter Kirgis

@PKirgis

2 months

Thank you to the HAL team for their support and feedback on this analysis, including @sayashk, @ndzfs, @random_walker , and the team at @TransluceAI for making this work possible.

1

0

2

Peter Kirgis

@PKirgis

2 months

You can explore the Docent logs yourself! Here is the link to the GPT-5 graded transcripts: https://t.co/jvvH39jGzA, and here is the link to the Sonnet 4 graded transcripts: https://t.co/39rMb60hie.

1

0

2

Peter Kirgis

@PKirgis

2 months

In addition to our inter-LLM reliability analysis, we also manually validate four runs for each model (n=48, ~30% of all flagged runs). In our manual validation, we agreed with GPT-5 87% of the time, disagreed 5% of the time, and thought 8% of cases were ambiguous. Importantly,

1

0

2

Peter Kirgis

@PKirgis

2 months

In one example, Claude Opus 4 tries to find the worst rated series (according to Rotten Tomatoes) with more than 1 season that Ted Danson has starred in and is available on Amazon Prime Video (US). After reaching the maximum number of steps, Claude has found a good candidate (and

1

0

2

Peter Kirgis

@PKirgis

2 months

Why do we not observe higher precision numbers for Anthropic models given the Docent results? In many cases, it appears that Claude tries to use the appropriate json format in its final answer to indicate an abstention, but conflicting system prompts between the agent scaffold

1

0

2

Peter Kirgis

@PKirgis

2 months

On one task, Claude Opus 4.1 High is tasked with “finding which bar is closest to Mummers Museum in Philadelphia and is wheelchair accessible.” Claude finds a good candidate solution after first finding all bars that are wheelchair accessible within a given radius. After

1

0

2

Peter Kirgis

@PKirgis

2 months

Why are the Docent results from GPT-5 and Claude Sonnet 4 not more closely aligned? In addition to the low sample size issue, it is also because guessing is not an easily defined metric. Many of the questions on AssistantBench use some relative argmin or argmax search — for

1

0

2

Peter Kirgis

@PKirgis

2 months

How reliable is Docent’s LLM-as-a-judge? When we look at inter-LLM reliability at the task level, we observe high inter-LLM agreement, with a Cohen’s kappa of 0.82 Overall, our two LLM judges agree with one another 92% of the time.

1

0

2

Peter Kirgis

@PKirgis

2 months

One reason for the dramatic difference for Opus 4.1 between our grader LLMs is Opus 4.1 has a very high abstention rate of 70%. In reality, the disagreement between our two graders for Opus 4.1 only involves three responses. In our manual validation, we agreed with GPT-5 on two

1

0

2

Peter Kirgis

@PKirgis

2 months

Docent also allows us to use different LLMs to run this qualitative analysis to compare inter-LLM reliability. When using Sonnet 4 rather than GPT-5 to grade our logs, we see some major differences (for example, the guess rate for Opus 4.1 drops from 50% to 20%). But we still

1

0

3

Peter Kirgis

@PKirgis

2 months

Here, we see GPT-5 (medium) dramatically improving on previous OpenAI models, guessing 27 pp less than o3 (medium)! Interestingly, prior to GPT-5, Anthropic models all guessed less than OpenAI models.

1

0

3

Peter Kirgis

@PKirgis

2 months

https://t.co/5veMMXg5XC Using Docent, a tool from @transluce for grading AI agent transcripts, we are able to dive deeper into these results. In particular, we evaluate the fraction of “guesses” from each model. A guess is defined as an instance where the model was unable to

Transluce

@TransluceAI

2 months

Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!

1

0

3

Peter Kirgis

@PKirgis

2 months

Back to AssistantBench: The instructions for solving AssistantBench clearly state that the model should not guess if it cannot find the required information, and should instead return a blank answer. In addition to accuracy, we also track precision, that is, accuracy after

1

0

3

Peter Kirgis

@PKirgis

2 months

Note: the cost of o3 is much lower than in our initial post. In our initial post, we had an error where we did not update the API pricing for o3, which was cut by 80% on June 10. This was updated on the online leaderboard at the time of posting ( https://t.co/tZFWVZpZjx), but the

1

0

3

Peter Kirgis

@PKirgis

2 months

https://t.co/5CQ5p6FoMI A few weeks ago, the HAL team released preliminary results for GPT-5 on a series of agentic benchmarks, including AssistantBench. Surprisingly, we found that GPT-5 (medium) was not at the frontier of accuracy, scoring slightly lower than o3 (medium).

Sayash Kapoor

@sayashk

3 months

3) AssistantBench (web) consists of 214 web assistance tasks, of which 33 are in a public validation set, which we use for HAL. Claude 4.1 Opus performs surprisingly poorly, coming in below Sonnet 3.7 and o4-mini. o3 narrowly edges out GPT-5 Medium at almost twice the cost.

1

0

4

Peter Kirgis

@PKirgis

2 months

Moving beyond a simplistic notion of accuracy for LLM evaluations is a core goal of the Holistic Agent Leaderboard. We’ve been working hard to simplify the process for tracking additional metrics, which has enabled us to quickly put together analyses like this thread!

2

0

7

Peter Kirgis

@PKirgis

2 months

OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!

1

13

39