
Hao Zhang
@haozhangml
Followers
5K
Following
1K
Media
7
Statuses
660
Asst. Prof. @HDSIUCSD and @ucsd_cse running @haoailab. Cofounder and runs @lmsysorg. 20% with @Snowflake
San Francisco
Joined July 2021
Pokémon Red has recently emerged as an evaluation benchmark, adopted by several top AI labs. But is it really a good benchmark for evaluating LLM capabilities or guiding LLM research?. We wrote this blog to dive into the challenges, surface the opportunities, and introduce.
🔥 Pokémon Red is becoming a go-to benchmark for testing advanced AIs such as Gemini. But is Pokémon Red really a good eval? We study this problem and identify three issues:.1️⃣ Navigation tasks are too hard. 2️⃣ Combat control is too simple. 3️⃣ Raising a strong Pokémon team is
0
4
21
RT @BeidiChen: Say hello to Multiverse — the Everything Everywhere All At Once of generative modeling. 💥 Lossless, adaptive, and gloriousl….
0
21
0
RT @SemiAnalysis_: Great work to sglang team at @lmsysorg showing the performance gains enabled by:. - High rank EP optimization.- Disaggre….
0
6
0
Curious how o3-pro performs beyond math & code? We just threw Tetris at it. ❌ Most models: game over after a few moves. ✅ o3-pro: still stacking, basically endless. Big jump in spatial planning. (also much better than other models on the more challenging Sokoban). See the.
[Lmgame Bench] o3-pro: A Milestone in LLM Gaming! 🕹️. The leap from o3 to o3-pro is bigger than you might have thought. We tested o3-pro on Tetris and Sokoban— achieved SOTA on both and outperformed its previous self by a big margin. 🔍. 🧱 Tetris Update.o3-pro: ✅ 8+ lines
2
0
7
Latest benchmarking results of claude-4 on games 👇👇.
[Lmgame Bench] 🎮 New Benchmark Results: Claude-Sonnet-4 and Claude-Opus-4. You asked—we delivered. We tested both models on 5 classic games: 2048, Candy Crush, Sokoban, Tetris, and Ace Attorney. Claude-Opus-4 stands out in Sokoban and Ace Attorney, outperforming Claude-Sonnet-4.
0
0
9
Wondering if the latest open-weight Qwen3 and Deepseek-R1-0528 performs on games? Check this thread out. Also, stay tuned for a new release of our game benchmark soon. 🧑🍳👩🍳👨🍳.
🔧🤖 New wave of open-source LLMs like Deekseek-R1-0528 and Qwen3-235B-A22B are leveling up with stronger agentic performance. We test them in head-to-head gameplay — the upgraded Deekseek-R1-0528 outsmarts strong reasoning models like o4-mini across several games and it nearly
0
0
9
always inspiring to watch @istoica05 predicting the future! 😃👍.
🧵 Just spent an hour with Ion Stoica @istoica05 (Berkeley prof, Databricks/Anyscale co-founder) discussing the future of AI. His insights on execution, China's AI advantage, and what young founders should build next are 🔥.Thread with the best takes 👇.
0
0
4
Check out shift parallelism we developed at snowflake!.
Excited to open-source Shift Parallelism, developed at @Snowflake AI Research for LLM inference!. With it, Arctic Inference + @vllm_project delivers:. 🚀3.4x faster e2e latency & 1.06x higher throughput.🚀1.7x faster generation & 2.25x lower response time.🚀16x higher throughput
0
1
11
RT @PY_Z001: I will be giving a talk in @GPU_MODE tomorrow (May 31 12pm PST) about FastVideo/STA/VSA. Come if you're interested!. https://….
0
21
0
RT @Snowflake: Solving real enterprise AI pain points! Our AI Research just shared two impactful new open-source efforts:. ➡️ Arctic-Text2S….
0
8
0
RT @mbzuai: An exceptional morning at #IFMLaunch! From @EricXing's vision for world models to @YejinChoinka 's insights on "bending scaling….
0
5
0
Looking forward to seeing chatbot area to move to the next chapter!.
📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by @a16z and UC Investments (@UofCalifornia), we're proud to have the support of those that believe in both the science and the mission. We’re
2
0
21
RT @tqchenml: #MLSys2025 make sure to attend 10:30am keynote @istoica05 An AI stack: from scaling AI workloads to evaluating LLMs. Checkou….
0
15
0
FastVideo v1 is here! 🎬. Our FastVideo team have been working hard and cooking up something new ☕️☕️: a unified, programmable API for video generation that simplifies model authoring and integrates various DiT-related optimizations. We hope to make video generation as seamless.
Announcing FastVideo V1, a unified framework for accelerating video generation. FastVideo V1 offers:.- A simple, consistent Python API.- State of the art model performance optimizations.- Optimized implementations of popular models. Blog:
0
2
30
Was casually chatting with a few buddies at snow the other day and realized that @Snowflake might just have the best text2sql team and capabilities on the planet NOW? 😎😀🔥. ✅ #1 on BIRD (single-model, an extremely competitive benchmark) — with our own post-trained.
🚀 Big news! Our collab w/ Snowflake, UCSD & UMD topped the BIRD leaderboard — beating prior SOTA by 2.8% in Text-to-SQL reasoning! RL was tough, but worth it. 📢 Best model coming soon. #AI #LLM #TextToSQL #ReinforcementLearning #Snowflake #UCSD #UMD #NLP #BIRDLeaderboard
0
0
12
RT @PyTorch: PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted proje….
0
46
0