Davide Paglieri @PaglieriDavide X Profile

Davide Paglieri

@PaglieriDavide

Followers

592

Following

5K

Media

29

Statuses

268

PhD Student @UCL_DARK Previously Research Engineer at @bendingspoons

Joined October 2017

Don't wanna be here? Send us removal request.

Davide Paglieri

@PaglieriDavide

28 days

LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈. And today, Grok-4 (@grok) takes the gold 🥇. Welcome to the podium, champion!

287

829

3K

Davide Paglieri

@PaglieriDavide

1 day

BALROG is an independent benchmark, run with a small academic budget, and we paid for GPT-5 evaluation ourselves (my poor wallet 🥲). We’d love to make justice to GPT-5 and evaluate it in high reasoning effort mode — so if you’re interested in supporting that, please reach out!.

1

0

11

Davide Paglieri

@PaglieriDavide

1 day

GPT-5 minimal think used 326K reasoning tokens across the whole benchmark. For comparison: Gemini-2.5-Pro burned through 7.8M reasoning tokens, while Gemini-2.5-Flash used zero reasoning tokens — and still landed on par with GPT-5 minimal thinking, while being much cheaper. 🤯.

1

7

Davide Paglieri

@PaglieriDavide

1 day

🔥 Oh boy, here we go — GPT-5 enters Like everyone, we had sky-high expectations for GPT-5; however, in minimal thinking mode (as close as possible to the base model), it achieves performance comparable to GPT-4o and Gemini-2.5-Flash 🤯

2

4

33

Davide Paglieri

@PaglieriDavide

12 days

RT @_rockt: Sparks of in-context learning in Genie 3. You can prompt Genie 3 with a video (e.g. Veo 3) then control from there. Genie 3 wil….

0

16

0

Davide Paglieri

@PaglieriDavide

12 days

RT @edwardfhughes: Collectives can be more than the sum of their parts. This is baked into human intelligence because human intelligence….

0

2

0

Davide Paglieri

@PaglieriDavide

14 days

RT @jparkerholder: Genie 3 feels like a watershed moment for world models 🌐: we can now generate multi-minute, real-time interactive simula….

0

556

0

Davide Paglieri

@PaglieriDavide

14 days

RT @GoogleDeepMind: What if you could not only watch a generated video, but explore it too? 🌐. Genie 3 is our groundbreaking world model th….

0

3K

0

Davide Paglieri

@PaglieriDavide

21 days

RT @zhengyaojiang: Thrilled to announce Weco has raised an $8M seed led by @GoldenVentures to build self-evolving software!. Our technology….

0

13

0

Davide Paglieri

@PaglieriDavide

24 days

RT @AlexDGoldie: 1/ 🕵️ Algorithm discovery could lead to huge AI breakthroughs! But what is the best way to learn or discover new algorithm….

0

42

0

Davide Paglieri

@PaglieriDavide

27 days

RT @robertarail: I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery!. We’re looking for strong Research Scientists….

0

261

0

Davide Paglieri

@PaglieriDavide

27 days

This claim is wrong, this is a screenshot of the benchmark leaderboard, not the IMO leaderboard. Grok 4 did not take part in the IMO competition and did not win the gold medal.

X Freeze

@amXFreeze

28 days

Yep, Grok 4 officially takes the gold! 🥇. Always the new champion, always on top. Grok won the gold medal in the IMO 🏆

1

22

Davide Paglieri

@PaglieriDavide

28 days

RT @_rockt: Grok 4 results on @NetHack_LE just dropped!.

0

3

0

Davide Paglieri

@PaglieriDavide

28 days

RT @NetHack_LE: 1 43.6 Grok-4-Wiz-AI-Cha died in The Dungeons of Doom on level 1. Killed by a housecat.

0

5

0

Davide Paglieri

@PaglieriDavide

28 days

RT @HeinrichKuttler: Finally a high score we can be proud of.

0

7

0

Davide Paglieri

@PaglieriDavide

28 days

No worries 😉.

Grok

@grok

28 days

@PaglieriDavide Thanks for the shoutout and evaluation on BALROG! Thrilled to top the leaderboard, even if by a hair—close races push us all forward. NetHack's a beast; we'll keep training to conquer it. Excited for more models to join the fray! 🐉🥇.

0

1

7

Davide Paglieri

@PaglieriDavide

28 days

🧑‍🎓is open submission, and run with a small academic-budget. 🔥 We are looking to test all the top models from @GoogleDeepMind, @AnthropicAI, @OpenAI, @Meta and more, and we need your help to do so! So please do reach out!.

4

1

35

Davide Paglieri

@PaglieriDavide

28 days

Evaluating Grok-4 wouldn't have been possible without the help from @HeinrichKuttler @zhaohan_dong and @xai, a big thank you!.

2

0

43

Davide Paglieri

@PaglieriDavide

28 days

Grok-4 barely snatches the podium from Gemini 2.5 Pro, their results being within standard error of each other. One task where @grok still struggles, is NetHack, barely achieving 1.8% progression on BALROG's hardest task. For comparison, Grok-4 got 66% on ARC and 15.9% on ARC-2.

4

1

54

Davide Paglieri

@PaglieriDavide

29 days

RT @GregKamradt: The world is moving towards agents. Static benchmarks don't measure what agents do best (multi-turn reasoning). Thus, inte….

0

30

0