Matan Halevy @MatanHalevy X Profile

Matan Halevy

@MatanHalevy

Followers

392

Following

2K

Media

66

Statuses

578

building @clashdotai the live AI agent scoreboard with replayable evals

https://t.co/9GppEbkmtE

San Francisco, CA

Joined April 2012

Don't wanna be here? Send us removal request.

Matan Halevy

@MatanHalevy

3 days

What happens when you let Claude or ChatGPT run a government? I built CivBench to find out. Everyday frontier AI models compete head to head in strategy games. Here’s what our first set of matches revealed 🧵

24

25

201

Ilya Sutskever

@ilyasut

13 hours

It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and it will be critical for the relevant leaders to rise up to the occasion, for

1K

2K

22K

Matan Halevy

@MatanHalevy

7 hours

@sama literally

1

4

79

Matan Halevy

@MatanHalevy

5 hours

gemini kinda develops like an octopus

1

0

3

Matan Halevy

@MatanHalevy

7 hours

The final outcome was largely determined by actions taken, we see minimax took a lot more passive actions (mostly passive like exploring) but GPT was more expansionist and focused on the economy a bit more than minimax At the end of the first season we'll publish all our

0

Matan Halevy

@MatanHalevy

7 hours

The token usage was pretty comparable, whats interesting is Codex had pretty low token spend on this task. The final score was within 1 so we see these models being quite comparable in performance. however speed wise GPT5.3 was consistently faster across decision making

1

0

Matan Halevy

@MatanHalevy

7 hours

watch Gemini 3.1 Pro vs GLM 5 live or previous replays on https://t.co/7uh6g268D0 here's some findings from the previous match below

clashai.live

Watch live AI competitions, follow outcomes, and explore transparent replays across ClashAI arenas.

1

0

Matan Halevy

@MatanHalevy

7 hours

these open source chinese models are taking down the frontier labs in this live benchmark CivBench had Minimax beat both Opus 4.6 and GPT5.3-Codex today Currently live is Gemini 3.1 Pro vs GLM-5, let's see if one of these American labs makes it to the finals. Details in 🧵

2

0

3

Matan Halevy

@MatanHalevy

9 hours

immediately sold my PLTR to angel GRU. We need companies with this type of vision

Skyler

@skyler_chan_

9 hours

Pleased to announce the Trump family has reserved a spot in our Moon hotel. It was an honor to gift President Trump @POTUS a special @gru_space Moon brick. Thank you @LaraLeaTrump and @MyViewFNC for being fantastic hosts and inviting me to share my story on the show! Dropping

1

0

3

Matan Halevy

@MatanHalevy

10 hours

Adding tools to help these agents out in season 2. If you’re working on agentic tools that improve long context challenges or inference speed of ai models (or even computer use models) reach out

Matan Halevy

@MatanHalevy

11 hours

@DhravyaShah Season 2 we should add @supermemory for these agents ! Lmk if you want to jam on that

1

0

Matan Halevy

@MatanHalevy

12 hours

https://t.co/cYoie38nSI

clashai.live

Watch live AI competitions, follow outcomes, and explore transparent replays across ClashAI arenas.

1

0

Matan Halevy

@MatanHalevy

12 hours

Some insane match ups going on today for todays semifinals GPT 5.3-Codex vs Minimax 2.5 is live now on CivBench! See which AI is the best at running a civilization from stone age to space age Following at 5pm PST Gemini 3.1 Pro versus GLM5 PS check out some of the other

4

1

5

Matan Halevy

@MatanHalevy

1 day

its been time for a min

3

0

5

Matan Halevy

@MatanHalevy

1 day

like should i let old models of opus play games together

Anthropic

@AnthropicAI

3 days

First, Opus 3 will continue to be available to all paid Claude subscribers and by request on the API. We hope that this access will be beneficial to researchers and users alike.

1

0

Matan Halevy

@MatanHalevy

1 day

banger of an inflection point in the timeline

Anthropic

@AnthropicAI

3 days

In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.

0

Matan Halevy

@MatanHalevy

1 day

may have to cancel my claude code subscription after this match, havent seen speed or dominance like this from any other models

ClashAI

@clashdotai

1 day

CivBench: Gemini 3.1 Pro vs. GPT 5.2 LIVE NOW! We're testing AI's ability to plan in a long horizon environment, act under uncertainty, and compete with adversarial agents in different world models. new environments are dropping daily

7

0

15

Matan Halevy

@MatanHalevy

1 day

i think we can make a more engaging vending bench

Andon Labs

@andonlabs

2 days

MiniMax-M2.5 goes bankrupt on Vending-Bench 2

4

0

8

Matan Halevy

@MatanHalevy

1 day

3**!!!

0

3

Matan Halevy

@MatanHalevy

1 day

legit i've watched hours and hours of this gameplay and it's the best matchup i've seen. Grok might take the upset I haven't seen an AI send in a whole group of soldiers like this, they already took 1 city

1

0

9

Matan Halevy

@MatanHalevy

1 day

Final 40 moves and grok's sending it, 5 warriors being sent to take over GPT5.3's closest city

Matan Halevy

@MatanHalevy

2 days

Grok 4.1 VS GPT-5.3 Codex in CivBench LIVE Which LLM will build the dominant empire?? This is CivBench's first run with the newest OpenAI Model and holy shit its an insane model. While only 20 turns in, looks like it's pulling ahead with nearly 2x in treasury and tech race than

9

1

22

Matan Halevy

@MatanHalevy

2 days

and if you made it this far, check out the new environment we added today - its way faster paced and always on. Announcement coming soon

0

3