Matan Halevy
@MatanHalevy
Followers
392
Following
2K
Media
66
Statuses
578
building @clashdotai the live AI agent scoreboard with replayable evals
San Francisco, CA
Joined April 2012
What happens when you let Claude or ChatGPT run a government? I built CivBench to find out. Everyday frontier AI models compete head to head in strategy games. Here’s what our first set of matches revealed 🧵
24
25
201
It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and it will be critical for the relevant leaders to rise up to the occasion, for
1K
2K
22K
The final outcome was largely determined by actions taken, we see minimax took a lot more passive actions (mostly passive like exploring) but GPT was more expansionist and focused on the economy a bit more than minimax At the end of the first season we'll publish all our
0
0
0
The token usage was pretty comparable, whats interesting is Codex had pretty low token spend on this task. The final score was within 1 so we see these models being quite comparable in performance. however speed wise GPT5.3 was consistently faster across decision making
1
1
0
watch Gemini 3.1 Pro vs GLM 5 live or previous replays on https://t.co/7uh6g268D0 here's some findings from the previous match below
clashai.live
Watch live AI competitions, follow outcomes, and explore transparent replays across ClashAI arenas.
1
0
0
these open source chinese models are taking down the frontier labs in this live benchmark CivBench had Minimax beat both Opus 4.6 and GPT5.3-Codex today Currently live is Gemini 3.1 Pro vs GLM-5, let's see if one of these American labs makes it to the finals. Details in 🧵
2
0
3
immediately sold my PLTR to angel GRU. We need companies with this type of vision
Pleased to announce the Trump family has reserved a spot in our Moon hotel. It was an honor to gift President Trump @POTUS a special @gru_space Moon brick. Thank you @LaraLeaTrump and @MyViewFNC for being fantastic hosts and inviting me to share my story on the show! Dropping
1
0
3
Adding tools to help these agents out in season 2. If you’re working on agentic tools that improve long context challenges or inference speed of ai models (or even computer use models) reach out
@DhravyaShah Season 2 we should add @supermemory for these agents ! Lmk if you want to jam on that
1
0
0
Some insane match ups going on today for todays semifinals GPT 5.3-Codex vs Minimax 2.5 is live now on CivBench! See which AI is the best at running a civilization from stone age to space age Following at 5pm PST Gemini 3.1 Pro versus GLM5 PS check out some of the other
4
1
5
banger of an inflection point in the timeline
In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.
0
0
0
may have to cancel my claude code subscription after this match, havent seen speed or dominance like this from any other models
CivBench: Gemini 3.1 Pro vs. GPT 5.2 LIVE NOW! We're testing AI's ability to plan in a long horizon environment, act under uncertainty, and compete with adversarial agents in different world models. new environments are dropping daily
7
0
15
i think we can make a more engaging vending bench
4
0
8
legit i've watched hours and hours of this gameplay and it's the best matchup i've seen. Grok might take the upset I haven't seen an AI send in a whole group of soldiers like this, they already took 1 city
1
0
9
Final 40 moves and grok's sending it, 5 warriors being sent to take over GPT5.3's closest city
Grok 4.1 VS GPT-5.3 Codex in CivBench LIVE Which LLM will build the dominant empire?? This is CivBench's first run with the newest OpenAI Model and holy shit its an insane model. While only 20 turns in, looks like it's pulling ahead with nearly 2x in treasury and tech race than
9
1
22
and if you made it this far, check out the new environment we added today - its way faster paced and always on. Announcement coming soon
0
0
3