Peter Kirgis Profile
Peter Kirgis

@PKirgis

Followers
55
Following
53
Media
13
Statuses
53

AI Researcher @PrincetonCITP

Joined September 2018
Don't wanna be here? Send us removal request.
@sayashk
Sayash Kapoor
21 days
📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
20
99
423
@sayashk
Sayash Kapoor
2 months
We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.
6
51
364
@PKirgis
Peter Kirgis
2 months
This recent OpenAI paper relates to a broader myopia on accuracy in evaluating frontier LLMs. Accuracy is of course important, but so are cost, efficiency, calibration, precision, and recall. Using the richness of the HAL logs, we are working to expand the scope of reported
0
0
3
@PKirgis
Peter Kirgis
2 months
Thank you to the HAL team for their support and feedback on this analysis, including @sayashk, @ndzfs, @random_walker , and the team at @TransluceAI for making this work possible.
1
0
2
@PKirgis
Peter Kirgis
2 months
You can explore the Docent logs yourself! Here is the link to the GPT-5 graded transcripts: https://t.co/jvvH39jGzA, and here is the link to the Sonnet 4 graded transcripts: https://t.co/39rMb60hie.
1
0
2
@PKirgis
Peter Kirgis
2 months
In addition to our inter-LLM reliability analysis, we also manually validate four runs for each model (n=48, ~30% of all flagged runs). In our manual validation, we agreed with GPT-5 87% of the time, disagreed 5% of the time, and thought 8% of cases were ambiguous. Importantly,
1
0
2
@PKirgis
Peter Kirgis
2 months
In one example, Claude Opus 4 tries to find the worst rated series (according to Rotten Tomatoes) with more than 1 season that Ted Danson has starred in and is available on Amazon Prime Video (US). After reaching the maximum number of steps, Claude has found a good candidate (and
1
0
2
@PKirgis
Peter Kirgis
2 months
Why do we not observe higher precision numbers for Anthropic models given the Docent results? In many cases, it appears that Claude tries to use the appropriate json format in its final answer to indicate an abstention, but conflicting system prompts between the agent scaffold
1
0
2
@PKirgis
Peter Kirgis
2 months
On one task, Claude Opus 4.1 High is tasked with “finding which bar is closest to Mummers Museum in Philadelphia and is wheelchair accessible.” Claude finds a good candidate solution after first finding all bars that are wheelchair accessible within a given radius. After
1
0
2
@PKirgis
Peter Kirgis
2 months
Why are the Docent results from GPT-5 and Claude Sonnet 4 not more closely aligned? In addition to the low sample size issue, it is also because guessing is not an easily defined metric. Many of the questions on AssistantBench use some relative argmin or argmax search — for
1
0
2
@PKirgis
Peter Kirgis
2 months
How reliable is Docent’s LLM-as-a-judge? When we look at inter-LLM reliability at the task level, we observe high inter-LLM agreement, with a Cohen’s kappa of 0.82 Overall, our two LLM judges agree with one another 92% of the time.
1
0
2
@PKirgis
Peter Kirgis
2 months
One reason for the dramatic difference for Opus 4.1 between our grader LLMs is Opus 4.1 has a very high abstention rate of 70%. In reality, the disagreement between our two graders for Opus 4.1 only involves three responses. In our manual validation, we agreed with GPT-5 on two
1
0
2
@PKirgis
Peter Kirgis
2 months
Docent also allows us to use different LLMs to run this qualitative analysis to compare inter-LLM reliability. When using Sonnet 4 rather than GPT-5 to grade our logs, we see some major differences (for example, the guess rate for Opus 4.1 drops from 50% to 20%). But we still
1
0
3
@PKirgis
Peter Kirgis
2 months
Here, we see GPT-5 (medium) dramatically improving on previous OpenAI models, guessing 27 pp less than o3 (medium)! Interestingly, prior to GPT-5, Anthropic models all guessed less than OpenAI models.
1
0
3
@PKirgis
Peter Kirgis
2 months
https://t.co/5veMMXg5XC Using Docent, a tool from @transluce for grading AI agent transcripts, we are able to dive deeper into these results. In particular, we evaluate the fraction of “guesses” from each model. A guess is defined as an instance where the model was unable to
@TransluceAI
Transluce
2 months
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
1
0
3
@PKirgis
Peter Kirgis
2 months
Back to AssistantBench: The instructions for solving AssistantBench clearly state that the model should not guess if it cannot find the required information, and should instead return a blank answer. In addition to accuracy, we also track precision, that is, accuracy after
1
0
3
@PKirgis
Peter Kirgis
2 months
Note: the cost of o3 is much lower than in our initial post. In our initial post, we had an error where we did not update the API pricing for o3, which was cut by 80% on June 10. This was updated on the online leaderboard at the time of posting ( https://t.co/tZFWVZpZjx), but the
1
0
3
@PKirgis
Peter Kirgis
2 months
https://t.co/5CQ5p6FoMI A few weeks ago, the HAL team released preliminary results for GPT-5 on a series of agentic benchmarks, including AssistantBench. Surprisingly, we found that GPT-5 (medium) was not at the frontier of accuracy, scoring slightly lower than o3 (medium).
@sayashk
Sayash Kapoor
3 months
3) AssistantBench (web) consists of 214 web assistance tasks, of which 33 are in a public validation set, which we use for HAL. Claude 4.1 Opus performs surprisingly poorly, coming in below Sonnet 3.7 and o4-mini. o3 narrowly edges out GPT-5 Medium at almost twice the cost.
1
0
4
@PKirgis
Peter Kirgis
2 months
Moving beyond a simplistic notion of accuracy for LLM evaluations is a core goal of the Holistic Agent Leaderboard. We’ve been working hard to simplify the process for tracking additional metrics, which has enabled us to quickly put together analyses like this thread!
2
0
7
@PKirgis
Peter Kirgis
2 months
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
1
13
39