LM Games @largemodelgame X Profile

LM Games

@largemodelgame

Followers

150

Following

109

Media

8

Statuses

42

We hosts live computer games for AI evaluations.

https://t.co/ehTIDGf4ZK

Joined February 2025

Don't wanna be here? Send us removal request.

Hao AI Lab

@haoailab

8 days

♠️♥️ Day 3 — Final Showdown! Our last day of the LLM Texas Hold’em tournament is live 🎥 📊 Current TrueSkill 2 Top 3: Grok-4-0709 > Gemini-2.5-Pro >GPT-5 (2025-08-07) Same prompt every day — around 20 hands/day, we will provide final TrueSkill2 ranking after today’s games!

0

2

7

Hao AI Lab

@haoailab

9 days

[Lmgame Bench] Day 2 Recap ♠️♥️ Chip Standings + Rank Changes 🎲 Each day includes ~20 rounds, so rank shifts may reflect short-term variance rather than stable strategy change. Final TrueSkill2 after full 60 rounds will tell more. 📊Ranks 1️⃣ Gemini-2.5-Pro 359 ⬆️ (+5) 2️⃣

1

3

8

Hao AI Lab

@haoailab

9 days

♠️♥️ Texas Hold’em LLM tournament Day 2 is live! 🆕 New layout: each model’s thought now shown on the right side. Here’s Day 1 chip results 🪙 — final TrueSkill2 rankings will be posted after the tournament ends. 1️⃣ GPT-5 — 336 2️⃣ Grok-4 — 305 3️⃣ Kimi-K2 — 304 4️⃣

0

3

12

Hao AI Lab

@haoailab

10 days

♠️♥️ The cards are on the table. Day 1 of our 3-day Texas Hold’em LLM tournament is live! 😍 🤖 6 models. 300 chips each. No strategy prompts, only pure reasoning. 🎥 Watch now → https://t.co/5WJ8iVVEHz #AI #TexasHoldem #LmgameBench

5

8

20

Hao AI Lab

@haoailab

13 days

[Lmgame Bench] ♠️♥️ Can LLMs bluff, fold, and bet like real poker players—with no strategic help? From Oct 28 – 30 (Tue–Thu, 10 AM – 4 PM PT), we’re hosting a 6 model live multi-agent Texas Hold’em tournament on Twitch 🎥 🕹️ https://t.co/5WJ8iVVEHz Each model starts with 300

1

5

12

Danijar Hafner

@danijarh

1 month

Excited to introduce Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model! 🌎🤖 Dreamer 4 pushes the frontier of world model accuracy, speed, and learning complex tasks from offline datasets. co-led with @wilson1yan

82

355

3K

Hao AI Lab

@haoailab

3 months

[Lmgame Bench] 🤔 Ever wondered how to evaluate different games in Lmgame-Bench or even add your own, but don’t know where to start? We’ve made it super easy to run evaluations and integrate new games. Our latest blog walks you through a few key features from Lmgame Bench

1

3

20

Hao AI Lab

@haoailab

3 months

[Lmgame Bench] 🔥 We tested Openai’s GPT-5-thinking-high and two recent open-source models in our Lmgame Bench! Across 26 models and 6 games (Sokoban, Tetris, 2048, Candy Crush, Mario, Ace Attorney), Here’s where they landed: GPT-5-thinking-high → #2

2

22

150

Hao AI Lab

@haoailab

4 months

Grok-4 has been thoroughly evaluated on math and coding benchmarks, but its performance in gaming environments is untested. We evaluate Grok-4 on the lmgame bench and find that it emerges as a leading model with superior gaming capabilities, ranking #2 on our leaderboard. 🥈 In

2

15

49

Hao AI Lab

@haoailab

5 months

[Lmgame Bench] 🎮 New Benchmark Results: Claude-Sonnet-4 and Claude-Opus-4 You asked—we delivered. We tested both models on 5 classic games: 2048, Candy Crush, Sokoban, Tetris, and Ace Attorney. Claude-Opus-4 stands out in Sokoban and Ace Attorney, outperforming Claude-Sonnet-4.

3

15

95

Hao AI Lab

@haoailab

5 months

[Lmgame Bench] o3-pro: A Milestone in LLM Gaming! 🕹️ The leap from o3 to o3-pro is bigger than you might have thought. We tested o3-pro on Tetris and Sokoban— achieved SOTA on both and outperformed its previous self by a big margin. 🔍 🧱 Tetris Update o3-pro: ✅ 8+ lines

11

107

556

Hao AI Lab

@haoailab

5 months

🔧🤖 New wave of open-source LLMs like Deekseek-R1-0528 and Qwen3-235B-A22B are leveling up with stronger agentic performance. We test them in head-to-head gameplay — the upgraded Deekseek-R1-0528 outsmarts strong reasoning models like o4-mini across several games and it nearly

7

64

285

Hao AI Lab

@haoailab

6 months

🚨 New Challenger: GROK joins the Game Arena Benchmark! We evaluated Grok3-mini-beta: thinkining on four games: 🧩 2048 | 🧱 Sokoban | 🍬 Candy Crush | 🎮 Phoenix Wright With fast progress, it’s already comparable to top models like OpenAI’s O1, previous O3-mini, and

5

19

99

Hao AI Lab

@haoailab

7 months

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA

6

15

109

Lucas Beyer (bl16)

@giffmana

7 months

With first Claude and now Gemini playing Pokemon, I was thinking of doing my own game-playing experiment over the weekend. However, I quickly learned that it's very far from the VLA-style "pixels->plan" that I naively thought it was, and wanted to do myself. It's like 90%

Logan Kilpatrick

@OfficialLoganK

7 months

Gemini 2.5 Pro just got the final 8th badge in Pokemon Blue, incredible pace of progress by the world's most powerful model!!! Next up: victory road and final 4 : )

63

89

1K

杉森雅和 (Masakazu Sugimori)

@m_sugimori

7 months

何と言うか、 25年前に死ぬ思いしながら作ったゲームがこういう使い方されるようになるとは思わなかったよｗしかも海外でｗにしても1章でAIが詰まるの面白い。特に1章の難易度はめっちゃ巧さんと三上さんが拘られた部分。人間には簡単なはずなんよｗその推論力ってやつが人間の強みなのか。

K.Ishi@生成AIの産業応用

@K_Ishi_AI

7 months

AIの真の推論能力を測るには「逆転裁判」をプレイさせればいいという素晴らしい発想。この指標は、逆転裁判を使いてAIの「証言から矛盾点を見つけ、それを裏付ける適切な証拠を選び、最も効果的に反論する」実践能力を評価。その結果、最もの優れた弁護士はo1だった↓ https://t.co/L8hdWVPZRP

53

20K

61K

AK

@_akhaliq

7 months

Hao AI Lab put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies results are out on Game Arena Bench on Hugging Face

16

45

282

Hao AI Lab

@haoailab

7 months

Check out our gradio leaderboard!

AK

@_akhaliq

7 months

Hao AI Lab put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies results are out on Game Arena Bench on Hugging Face

2

4

41

Hao AI Lab

@haoailab

7 months

When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️‍♂️ Inspired by that idea, we turned to Ace

24

271

2K

Hao AI Lab

@haoailab

7 months

LLaMA-4 Maverick performs well on reasoning benchmarks and ranks 2nd on the Chatbot Arena, yet its true performance remains controversial. What if we put them in a transparent gaming environment? 🎮 Our benchmark tells a different story...🤔 Will true intelligence shine through

1

23

94