
Tony Lee
@tonyh_lee
Followers
584
Following
125
Media
4
Statuses
74
Working on LLaMA @AIatMeta . PhD Candidate @StanfordAILab @StanfordNLP. Author of HELM + extensions: https://t.co/xL2p3Z86mB.
Stanford, CA
Joined December 2021
🚀 We just launched RoboArena — a real-world evaluation platform for robot policies!.Think Chatbot Arena, but for robotics. 📝 Paper: 🌐 Website: Joint work with @pranav_atreya and @KarlPertsch. advised by @percyliang,.
We’re releasing the RoboArena today!🤖🦾. Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help!. We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :).🧵
0
15
39
RT @percyliang: Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbba….
0
548
0
RT @percyliang: What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire….
0
193
0
RT @Ahmad_Al_Dahle: Introducing our first set of Llama 4 models!. We’ve been hard at work doing a complete re-design of the Llama series. I….
0
937
0
RT @percyliang: HELM has a new leaderboard: HELM Capabilities v1.0! We curated 5 challenging datasets (MMLU-Pro, GPQA, IFEval, WildBench,….
0
26
0
RT @percyliang: 1/🧵How do we know if AI is actually ready for healthcare? We built a benchmark, MedHELM, that tests LMs on real clinical ta….
0
70
0
RT @michiyasunaga: 📢 Introducing Multimodal RewardBench:. A holistic, human-annotated benchmark for evaluating VLM reward models or judges….
0
36
0
RT @percyliang: HELM Lite v1.13.0 is out! Tasks evaluated: {NarrativeQA, NaturalQA, OpenbookQA, MMLU, GSM8K, MATH, LegalBench, MedQA, WMT14….
0
6
0
VHELM is a living breathing benchmark for vision-language models. We welcome suggestions for new evals to include for the leaderboard: We will also continue to add new models (e.g., @xai’s new Grok-2 Vision), so stay tuned!.
0
0
0
In 3rd place, Qwen2-VL Instruct 72B ( is the only open model in the top 10. With just 72B parameters, it often outperforms many of the closed-source models. Congrats to the @Alibaba_Qwen team!.
2
0
0
RT @RishiBommasani: Today, HELM was recognized by @TmlrOrg with its best paper award!. The true success of HELM has been the sustained main….
0
16
0
RT @TmlrOrg: Outstanding Certification 2: “Holistic Evaluation of Language Models” (HELM), by a large team led by Percy Liang (@percyliang)….
0
4
0
RT @AIatMeta: As we continue to explore new post-training techniques, today we're releasing Llama 3.3 — a new open source model that delive….
0
508
0
RT @jwthickstun: I am recruiting PhD students for Fall '25 at Cornell! I plan to admit multiple students interested in building more contro….
0
47
0
RT @ManlingLi_: 🏆We are thrilled that our Embodied Agent Interface ( received the Best Paper Award at SoCal NLP 202….
0
19
0