MatanHalevy Profile Banner
Matan Halevy Profile
Matan Halevy

@MatanHalevy

Followers
392
Following
2K
Media
66
Statuses
578

building @clashdotai the live AI agent scoreboard with replayable evals

San Francisco, CA
Joined April 2012
Don't wanna be here? Send us removal request.
@MatanHalevy
Matan Halevy
3 days
What happens when you let Claude or ChatGPT run a government? I built CivBench to find out. Everyday frontier AI models compete head to head in strategy games. Here’s what our first set of matches revealed 🧵
24
25
201
@ilyasut
Ilya Sutskever
13 hours
It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and it will be critical for the relevant leaders to rise up to the occasion, for
1K
2K
22K
@MatanHalevy
Matan Halevy
7 hours
@sama literally
1
4
79
@MatanHalevy
Matan Halevy
5 hours
gemini kinda develops like an octopus
1
0
3
@MatanHalevy
Matan Halevy
7 hours
The final outcome was largely determined by actions taken, we see minimax took a lot more passive actions (mostly passive like exploring) but GPT was more expansionist and focused on the economy a bit more than minimax At the end of the first season we'll publish all our
0
0
0
@MatanHalevy
Matan Halevy
7 hours
The token usage was pretty comparable, whats interesting is Codex had pretty low token spend on this task. The final score was within 1 so we see these models being quite comparable in performance. however speed wise GPT5.3 was consistently faster across decision making
1
1
0
@MatanHalevy
Matan Halevy
7 hours
watch Gemini 3.1 Pro vs GLM 5 live or previous replays on https://t.co/7uh6g268D0 here's some findings from the previous match below
Tweet card summary image
clashai.live
Watch live AI competitions, follow outcomes, and explore transparent replays across ClashAI arenas.
1
0
0
@MatanHalevy
Matan Halevy
7 hours
these open source chinese models are taking down the frontier labs in this live benchmark CivBench had Minimax beat both Opus 4.6 and GPT5.3-Codex today Currently live is Gemini 3.1 Pro vs GLM-5, let's see if one of these American labs makes it to the finals. Details in 🧵
2
0
3
@MatanHalevy
Matan Halevy
9 hours
immediately sold my PLTR to angel GRU. We need companies with this type of vision
@skyler_chan_
Skyler
9 hours
Pleased to announce the Trump family has reserved a spot in our Moon hotel. It was an honor to gift President Trump @POTUS a special @gru_space Moon brick. Thank you @LaraLeaTrump and @MyViewFNC for being fantastic hosts and inviting me to share my story on the show! Dropping
1
0
3
@MatanHalevy
Matan Halevy
10 hours
Adding tools to help these agents out in season 2. If you’re working on agentic tools that improve long context challenges or inference speed of ai models (or even computer use models) reach out
@MatanHalevy
Matan Halevy
11 hours
@DhravyaShah Season 2 we should add @supermemory for these agents ! Lmk if you want to jam on that
1
0
0
@MatanHalevy
Matan Halevy
12 hours
Some insane match ups going on today for todays semifinals GPT 5.3-Codex vs Minimax 2.5 is live now on CivBench! See which AI is the best at running a civilization from stone age to space age Following at 5pm PST Gemini 3.1 Pro versus GLM5 PS check out some of the other
4
1
5
@MatanHalevy
Matan Halevy
1 day
its been time for a min
3
0
5
@MatanHalevy
Matan Halevy
1 day
like should i let old models of opus play games together
@AnthropicAI
Anthropic
3 days
First, Opus 3 will continue to be available to all paid Claude subscribers and by request on the API. We hope that this access will be beneficial to researchers and users alike.
1
0
0
@MatanHalevy
Matan Halevy
1 day
banger of an inflection point in the timeline
@AnthropicAI
Anthropic
3 days
In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.
0
0
0
@MatanHalevy
Matan Halevy
1 day
may have to cancel my claude code subscription after this match, havent seen speed or dominance like this from any other models
@clashdotai
ClashAI
1 day
CivBench: Gemini 3.1 Pro vs. GPT 5.2 LIVE NOW! We're testing AI's ability to plan in a long horizon environment, act under uncertainty, and compete with adversarial agents in different world models. new environments are dropping daily
7
0
15
@MatanHalevy
Matan Halevy
1 day
i think we can make a more engaging vending bench
@andonlabs
Andon Labs
2 days
MiniMax-M2.5 goes bankrupt on Vending-Bench 2
4
0
8
@MatanHalevy
Matan Halevy
1 day
3**!!!
0
0
3
@MatanHalevy
Matan Halevy
1 day
legit i've watched hours and hours of this gameplay and it's the best matchup i've seen. Grok might take the upset I haven't seen an AI send in a whole group of soldiers like this, they already took 1 city
1
0
9
@MatanHalevy
Matan Halevy
1 day
Final 40 moves and grok's sending it, 5 warriors being sent to take over GPT5.3's closest city
@MatanHalevy
Matan Halevy
2 days
Grok 4.1 VS GPT-5.3 Codex in CivBench LIVE Which LLM will build the dominant empire?? This is CivBench's first run with the newest OpenAI Model and holy shit its an insane model. While only 20 turns in, looks like it's pulling ahead with nearly 2x in treasury and tech race than
9
1
22
@MatanHalevy
Matan Halevy
2 days
and if you made it this far, check out the new environment we added today - its way faster paced and always on. Announcement coming soon
0
0
3