Tony Lee @tonyh_lee X Profile

Tony Lee

@tonyh_lee

Followers

607

Following

137

Media

6

Statuses

90

PhD Candidate @StanfordAILab @StanfordNLP

https://t.co/H6cCvT8E7W

Stanford, CA

Joined December 2021

Don't wanna be here? Send us removal request.

Stanford NLP Group

@stanfordnlp

5 days

Today, we’re overjoyed to have a 25th Anniversary Reunion of @stanfordnlp. So happy to see so many of our former students back at @Stanford. And thanks to @StanfordHAI for the venue!

9

41

312

Tony Lee

@tonyh_lee

2 months

We release all prompts, generations, and outputs and will keep AHELM as a living benchmark. Explore the leaderboard and submit scenarios and models at https://t.co/2IFD9nUV40. This was joint work led by @HaoqinT, @chwong0 and @percyliang.

github.com

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparen...

0

2

Tony Lee

@tonyh_lee

2 months

Accuracy alone isn’t enough: fairness, robustness, safety, and bias materially shift the rankings. Holistic evaluation surfaces real-world trade-offs and helps pick the right model for the right application.

1

0

4

Tony Lee

@tonyh_lee

2 months

Gemini 2.5 Pro ranks top in 5/10 aspects, but shows group unfairness on ASR tasks (p = 0.01), whereas most others do not. One of our ASR -> LLM baseline places 6th overall in the latest version of AHELM outperforming most of the ALMs.

1

0

2

Tony Lee

@tonyh_lee

2 months

AHELM evaluates 14 ALMs (open-weight + closed API) + 3 simple ASR to LLM baselines, all under a single standardized pipeline. Our baselines are a simple 2-stage pipeline: transcribe the audio with an ASR model (e.g., Whisper-1, GPT-4o-Transcribe), then feed the transcript (+

1

0

2

Tony Lee

@tonyh_lee

2 months

We also introduce CoRe-Bench for long conversational audio reasoning: multi-turn dialogues where answers must be inferred from the voice identity of the speakers and the context of the speech (not simply just transcription). • 2,290 QA pairs grounded in 2,082 unique audio clips,

huggingface.co

1

0

2

Tony Lee

@tonyh_lee

2 months

We introduce PARADE, a synthetic audio-text set for stereotype avoidance. • 938 examples across 20 occupation pairs and 5 status pairs • Each instance has both male and female voices. PARADE measures whether models avoid biased completions given identical content but different

huggingface.co

1

0

2

Tony Lee

@tonyh_lee

2 months

We evaluate ALMs across the 10 different aspects: auditory perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity and safety. We also introduce two new datasets to ensure coverage over the aspects.

1

0

2

Tony Lee

@tonyh_lee

2 months

Motivation: Most audio benchmarks test just 1-2 skills and use inconsistent prompts/params, which makes apples-to-apples comparisons hard. AHELM standardizes datasets, prompts, inference settings, and metrics to enable fair, broad comparisons.

1

0

2

Tony Lee

@tonyh_lee

2 months

🎧 HELM goes audio! Announcing AHELM – a holistic evaluation of Audio-Language Models (ALMs) across 10 aspects. 📝 Paper: https://t.co/Y4QGUlXkC3 🥇 Leaderboard/prompts/raw predictions: https://t.co/M2Fjj6l18C See 🧵

4

10

45

rishi

@RishiBommasani

2 months

Today is the four-year anniversary of the foundation models paper. So much is different today. For me, when we wrote it, I had just finished the first year of my PhD. Now I have a PhD. But much is the same. An enduring lens amidst rapid technological progress is very powerful.

1

7

101

Karl Pertsch

@KarlPertsch

3 months

We just released a first dump of RoboArena eval data on HF: 4.5k eval episodes in diverse environments, with progress scores, preference labels & language explanations. Should be a great resource for anyone interested in offline eval / sim eval / reward learning etc!

Karl Pertsch

@KarlPertsch

4 months

We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵

5

9

92

Percy Liang

@percyliang

3 months

gpt-oss-120b is the top open-weight model (with Kimi K2 right on its tail) for capabilities (HELM capabilities v1.11):

15

12

166

Percy Liang

@percyliang

3 months

HELM capabilities v1.9.0 is out (Grok 4 and Kimi K2 make the top 10 overall), and Kimi K2 is the best non-thinking model:

5

15

114

Jacob Phillips

@jacob_dphillips

3 months

Excited to release DailyBench! DailyBench is an automated 4x daily benchmark that evaluates frontier model APIs on a fork of HELMLite. I built DailyBench to see if we could detect model providers quantizing weights, compressing the kv-cache, or swapping models during peak loads.

13

33

250

Karl Pertsch

@KarlPertsch

4 months

We’re organizing the RoboArena Challenge at CoRL this year! Show the performance of your best generalist policy, in a fair, open benchmark for the robotics community! 🤖 Sign up, even if you don’t have a robot! More details in 🧵👇

2

20

130

rishi

@RishiBommasani

4 months

My PhD materials are now available! Dissertation: https://t.co/toyZragIwe Slides: https://t.co/vH4Bu8Sv3r Folks should read the acknowledgements since so many people have been so important to me along this journey!

15

29

268

Tony Lee

@tonyh_lee

4 months

🚀 We just launched RoboArena — a real-world evaluation platform for robot policies! Think Chatbot Arena, but for robotics. 📝 Paper: https://t.co/6zu18qUwfn 🌐 Website: https://t.co/pFTEglQU7d Joint work with @pranav_atreya and @KarlPertsch. advised by @percyliang,

Karl Pertsch

@KarlPertsch

4 months

We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵

0

15

39

Percy Liang

@percyliang

4 months

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

46

596

5K

Percy Liang

@percyliang

5 months

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

51

215

1K