tonyh_lee Profile Banner
Tony Lee Profile
Tony Lee

@tonyh_lee

Followers
584
Following
125
Media
4
Statuses
74

Working on LLaMA @AIatMeta . PhD Candidate @StanfordAILab @StanfordNLP. Author of HELM + extensions: https://t.co/xL2p3Z86mB.

Stanford, CA
Joined December 2021
Don't wanna be here? Send us removal request.
@tonyh_lee
Tony Lee
1 day
RT @RishiBommasani: My PhD materials are now available!. Dissertation: Slides: Folks shou….
0
26
0
@tonyh_lee
Tony Lee
15 days
🚀 We just launched RoboArena — a real-world evaluation platform for robot policies!.Think Chatbot Arena, but for robotics. 📝 Paper: 🌐 Website: Joint work with @pranav_atreya and @KarlPertsch. advised by @percyliang,.
@KarlPertsch
Karl Pertsch
15 days
We’re releasing the RoboArena today!🤖🦾. Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help!. We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :).🧵
0
15
39
@tonyh_lee
Tony Lee
16 days
RT @percyliang: Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbba….
0
548
0
@tonyh_lee
Tony Lee
2 months
RT @percyliang: What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire….
0
193
0
@tonyh_lee
Tony Lee
3 months
RT @Ahmad_Al_Dahle: Introducing our first set of Llama 4 models!. We’ve been hard at work doing a complete re-design of the Llama series. I….
0
937
0
@tonyh_lee
Tony Lee
4 months
RT @percyliang: HELM has a new leaderboard: HELM Capabilities v1.0! We curated 5 challenging datasets (MMLU-Pro, GPQA, IFEval, WildBench,….
0
26
0
@tonyh_lee
Tony Lee
4 months
RT @percyliang: 1/🧵How do we know if AI is actually ready for healthcare? We built a benchmark, MedHELM, that tests LMs on real clinical ta….
0
70
0
@tonyh_lee
Tony Lee
4 months
RT @michiyasunaga: 📢 Introducing Multimodal RewardBench:. A holistic, human-annotated benchmark for evaluating VLM reward models or judges….
0
36
0
@tonyh_lee
Tony Lee
5 months
RT @percyliang: HELM Lite v1.13.0 is out! Tasks evaluated: {NarrativeQA, NaturalQA, OpenbookQA, MMLU, GSM8K, MATH, LegalBench, MedQA, WMT14….
0
6
0
@tonyh_lee
Tony Lee
6 months
VHELM is a living breathing benchmark for vision-language models. We welcome suggestions for new evals to include for the leaderboard: We will also continue to add new models (e.g., @xai’s new Grok-2 Vision), so stay tuned!.
0
0
0
@tonyh_lee
Tony Lee
6 months
In 3rd place, Qwen2-VL Instruct 72B ( is the only open model in the top 10. With just 72B parameters, it often outperforms many of the closed-source models. Congrats to the @Alibaba_Qwen team!.
2
0
0
@tonyh_lee
Tony Lee
6 months
In practice, o1 uses an average of ~715 reasoning tokens per example in VHELM, with peaks of up to ~20,000 tokens/example. Users should be cognizant of the extra costs that comes with the reasoning tokens:
1
0
0
@tonyh_lee
Tony Lee
6 months
The latest o1 model shows its dominance in multimodal reasoning, performing the best on multiple reasoning-intensive scenarios (e.g., MathVista, Exams-V, MMMU, …). We set `max_completion_tokens` to ~25,000 following
1
0
0
@tonyh_lee
Tony Lee
6 months
Improving upon its predecessor (Gemini 1.5 Flash 002), Gemini 2.0 Flash Experimental (, is now at the top of the leaderboard followed by o1 (2024-12-17) and Qwen-2 VL Instruct 72B.
1
0
0
@tonyh_lee
Tony Lee
6 months
🚀 VHELM v2.1.1 (leaderboard for VLMs - is out! We added 5 new models: o1 (2024-12-17), GPT-4o (2024-11-20), Gemini 2.0 Flash Experimental, and Qwen2-VL 7B/72B. 🥇 Leaderboard/prompts with images/raw predictions: See 🧵 below.
Tweet media one
1
11
19
@tonyh_lee
Tony Lee
7 months
RT @RishiBommasani: Today, HELM was recognized by @TmlrOrg with its best paper award!. The true success of HELM has been the sustained main….
0
16
0
@tonyh_lee
Tony Lee
7 months
RT @TmlrOrg: Outstanding Certification 2: “Holistic Evaluation of Language Models” (HELM), by a large team led by Percy Liang (@percyliang)….
0
4
0
@tonyh_lee
Tony Lee
7 months
RT @AIatMeta: As we continue to explore new post-training techniques, today we're releasing Llama 3.3 — a new open source model that delive….
0
508
0
@tonyh_lee
Tony Lee
7 months
RT @jwthickstun: I am recruiting PhD students for Fall '25 at Cornell! I plan to admit multiple students interested in building more contro….
0
47
0
@tonyh_lee
Tony Lee
7 months
RT @ManlingLi_: 🏆We are thrilled that our Embodied Agent Interface ( received the Best Paper Award at SoCal NLP 202….
0
19
0