largemodelgame Profile Banner
LM Games Profile
LM Games

@largemodelgame

Followers
150
Following
109
Media
8
Statuses
42

We hosts live computer games for AI evaluations.

Joined February 2025
Don't wanna be here? Send us removal request.
@haoailab
Hao AI Lab
8 days
♠️♥️ Day 3 — Final Showdown! Our last day of the LLM Texas Hold’em tournament is live 🎥 📊 Current TrueSkill 2 Top 3: Grok-4-0709 > Gemini-2.5-Pro >GPT-5 (2025-08-07) Same prompt every day — around 20 hands/day, we will provide final TrueSkill2 ranking after today’s games!
0
2
7
@haoailab
Hao AI Lab
9 days
[Lmgame Bench] Day 2 Recap ♠️♥️ Chip Standings + Rank Changes 🎲 Each day includes ~20 rounds, so rank shifts may reflect short-term variance rather than stable strategy change. Final TrueSkill2 after full 60 rounds will tell more. 📊Ranks 1️⃣ Gemini-2.5-Pro 359 ⬆️ (+5) 2️⃣
1
3
8
@haoailab
Hao AI Lab
9 days
♠️♥️ Texas Hold’em LLM tournament Day 2 is live! 🆕 New layout: each model’s thought now shown on the right side. Here’s Day 1 chip results 🪙 — final TrueSkill2 rankings will be posted after the tournament ends. 1️⃣ GPT-5 — 336 2️⃣ Grok-4 — 305 3️⃣ Kimi-K2 — 304 4️⃣
0
3
12
@haoailab
Hao AI Lab
10 days
♠️♥️ The cards are on the table. Day 1 of our 3-day Texas Hold’em LLM tournament is live! 😍 🤖 6 models. 300 chips each. No strategy prompts, only pure reasoning. 🎥 Watch now → https://t.co/5WJ8iVVEHz #AI #TexasHoldem #LmgameBench
5
8
20
@haoailab
Hao AI Lab
13 days
[Lmgame Bench] ♠️♥️ Can LLMs bluff, fold, and bet like real poker players—with no strategic help? From Oct 28 – 30 (Tue–Thu, 10 AM – 4 PM PT), we’re hosting a 6 model live multi-agent Texas Hold’em tournament on Twitch 🎥 🕹️ https://t.co/5WJ8iVVEHz Each model starts with 300
1
5
12
@danijarh
Danijar Hafner
1 month
Excited to introduce Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model! 🌎🤖 Dreamer 4 pushes the frontier of world model accuracy, speed, and learning complex tasks from offline datasets. co-led with @wilson1yan
82
355
3K
@haoailab
Hao AI Lab
3 months
[Lmgame Bench] 🤔 Ever wondered how to evaluate different games in Lmgame-Bench or even add your own, but don’t know where to start? We’ve made it super easy to run evaluations and integrate new games. Our latest blog walks you through a few key features from Lmgame Bench
1
3
20
@haoailab
Hao AI Lab
3 months
[Lmgame Bench] 🔥 We tested Openai’s GPT-5-thinking-high and two recent open-source models in our Lmgame Bench! Across 26 models and 6 games (Sokoban, Tetris, 2048, Candy Crush, Mario, Ace Attorney), Here’s where they landed: GPT-5-thinking-high → #2
2
22
150
@haoailab
Hao AI Lab
4 months
Grok-4 has been thoroughly evaluated on math and coding benchmarks, but its performance in gaming environments is untested. We evaluate Grok-4 on the lmgame bench and find that it emerges as a leading model with superior gaming capabilities, ranking #2 on our leaderboard. 🥈 In
2
15
49
@haoailab
Hao AI Lab
5 months
[Lmgame Bench] 🎮 New Benchmark Results: Claude-Sonnet-4 and Claude-Opus-4 You asked—we delivered. We tested both models on 5 classic games: 2048, Candy Crush, Sokoban, Tetris, and Ace Attorney. Claude-Opus-4 stands out in Sokoban and Ace Attorney, outperforming Claude-Sonnet-4.
3
15
95
@haoailab
Hao AI Lab
5 months
[Lmgame Bench] o3-pro: A Milestone in LLM Gaming! 🕹️ The leap from o3 to o3-pro is bigger than you might have thought. We tested o3-pro on Tetris and Sokoban— achieved SOTA on both and outperformed its previous self by a big margin. 🔍 🧱 Tetris Update o3-pro: ✅ 8+ lines
11
107
556
@haoailab
Hao AI Lab
5 months
🔧🤖 New wave of open-source LLMs like Deekseek-R1-0528 and Qwen3-235B-A22B are leveling up with stronger agentic performance. We test them in head-to-head gameplay — the upgraded Deekseek-R1-0528 outsmarts strong reasoning models like o4-mini across several games and it nearly
7
64
285
@haoailab
Hao AI Lab
6 months
🚨 New Challenger: GROK joins the Game Arena Benchmark! We evaluated Grok3-mini-beta: thinkining on four games: 🧩 2048 | 🧱 Sokoban | 🍬 Candy Crush | 🎮 Phoenix Wright With fast progress, it’s already comparable to top models like OpenAI’s O1, previous O3-mini, and
5
19
99
@haoailab
Hao AI Lab
7 months
This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA
6
15
109
@giffmana
Lucas Beyer (bl16)
7 months
With first Claude and now Gemini playing Pokemon, I was thinking of doing my own game-playing experiment over the weekend. However, I quickly learned that it's very far from the VLA-style "pixels->plan" that I naively thought it was, and wanted to do myself. It's like 90%
@OfficialLoganK
Logan Kilpatrick
7 months
Gemini 2.5 Pro just got the final 8th badge in Pokemon Blue, incredible pace of progress by the world's most powerful model!!! Next up: victory road and final 4 : )
63
89
1K
@m_sugimori
杉森 雅和 (Masakazu Sugimori)
7 months
何と言うか、 25年前に死ぬ思いしながら作ったゲームがこういう使い方されるようになるとは思わなかったよw しかも海外でw にしても1章でAIが詰まるの面白い。 特に1章の難易度はめっちゃ巧さんと三上さんが拘られた部分。 人間には簡単なはずなんよw その推論力ってやつが人間の強みなのか。
@K_Ishi_AI
K.Ishi@生成AIの産業応用
7 months
AIの真の推論能力を測るには「逆転裁判」をプレイさせればいいという素晴らしい発想。 この指標は、逆転裁判を使いてAIの「証言から矛盾点を見つけ、それを裏付ける適切な証拠を選び、最も効果的に反論する」実践能力を評価。 その結果、最もの優れた弁護士はo1だった↓ https://t.co/L8hdWVPZRP
53
20K
61K
@_akhaliq
AK
7 months
Hao AI Lab put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies results are out on Game Arena Bench on Hugging Face
16
45
282
@haoailab
Hao AI Lab
7 months
Check out our gradio leaderboard!
@_akhaliq
AK
7 months
Hao AI Lab put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies results are out on Game Arena Bench on Hugging Face
2
4
41
@haoailab
Hao AI Lab
7 months
When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️‍♂️ Inspired by that idea, we turned to Ace
24
271
2K
@haoailab
Hao AI Lab
7 months
LLaMA-4 Maverick performs well on reasoning benchmarks and ranks 2nd on the Chatbot Arena, yet its true performance remains controversial. What if we put them in a transparent gaming environment? 🎮 Our benchmark tells a different story...🤔 Will true intelligence shine through
1
23
94