PaglieriDavide Profile Banner
Davide Paglieri Profile
Davide Paglieri

@PaglieriDavide

Followers
592
Following
5K
Media
29
Statuses
268

PhD Student @UCL_DARK Previously Research Engineer at @bendingspoons

Joined October 2017
Don't wanna be here? Send us removal request.
@PaglieriDavide
Davide Paglieri
28 days
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈. And today, Grok-4 (@grok) takes the gold 🥇. Welcome to the podium, champion!
Tweet media one
287
829
3K
@PaglieriDavide
Davide Paglieri
1 day
BALROG is an independent benchmark, run with a small academic budget, and we paid for GPT-5 evaluation ourselves (my poor wallet 🥲). We’d love to make justice to GPT-5 and evaluate it in high reasoning effort mode — so if you’re interested in supporting that, please reach out!.
1
0
11
@PaglieriDavide
Davide Paglieri
1 day
GPT-5 minimal think used 326K reasoning tokens across the whole benchmark. For comparison: Gemini-2.5-Pro burned through 7.8M reasoning tokens, while Gemini-2.5-Flash used zero reasoning tokens — and still landed on par with GPT-5 minimal thinking, while being much cheaper. 🤯.
1
1
7
@PaglieriDavide
Davide Paglieri
1 day
🔥 Oh boy, here we go — GPT-5 enters Like everyone, we had sky-high expectations for GPT-5; however, in minimal thinking mode (as close as possible to the base model), it achieves performance comparable to GPT-4o and Gemini-2.5-Flash 🤯
Tweet media one
2
4
33
@PaglieriDavide
Davide Paglieri
12 days
RT @_rockt: Sparks of in-context learning in Genie 3. You can prompt Genie 3 with a video (e.g. Veo 3) then control from there. Genie 3 wil….
0
16
0
@PaglieriDavide
Davide Paglieri
12 days
RT @edwardfhughes: Collectives can be more than the sum of their parts. This is baked into human intelligence because human intelligence….
0
2
0
@PaglieriDavide
Davide Paglieri
14 days
RT @jparkerholder: Genie 3 feels like a watershed moment for world models 🌐: we can now generate multi-minute, real-time interactive simula….
0
556
0
@PaglieriDavide
Davide Paglieri
14 days
RT @GoogleDeepMind: What if you could not only watch a generated video, but explore it too? 🌐. Genie 3 is our groundbreaking world model th….
0
3K
0
@PaglieriDavide
Davide Paglieri
21 days
RT @zhengyaojiang: Thrilled to announce Weco has raised an $8M seed led by @GoldenVentures to build self-evolving software!. Our technology….
0
13
0
@PaglieriDavide
Davide Paglieri
24 days
RT @AlexDGoldie: 1/ 🕵️ Algorithm discovery could lead to huge AI breakthroughs! But what is the best way to learn or discover new algorithm….
0
42
0
@PaglieriDavide
Davide Paglieri
27 days
RT @robertarail: I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery!. We’re looking for strong Research Scientists….
0
261
0
@PaglieriDavide
Davide Paglieri
27 days
This claim is wrong, this is a screenshot of the benchmark leaderboard, not the IMO leaderboard. Grok 4 did not take part in the IMO competition and did not win the gold medal.
@amXFreeze
X Freeze
28 days
Yep, Grok 4 officially takes the gold! 🥇. Always the new champion, always on top. Grok won the gold medal in the IMO 🏆
Tweet media one
1
1
22
@PaglieriDavide
Davide Paglieri
28 days
RT @_rockt: Grok 4 results on @NetHack_LE just dropped!.
0
3
0
@PaglieriDavide
Davide Paglieri
28 days
RT @NetHack_LE: 1 43.6 Grok-4-Wiz-AI-Cha died in The Dungeons of Doom on level 1. Killed by a housecat.
0
5
0
@PaglieriDavide
Davide Paglieri
28 days
RT @HeinrichKuttler: Finally a high score we can be proud of.
0
7
0
@PaglieriDavide
Davide Paglieri
28 days
No worries 😉.
@grok
Grok
28 days
@PaglieriDavide Thanks for the shoutout and evaluation on BALROG! Thrilled to top the leaderboard, even if by a hair—close races push us all forward. NetHack's a beast; we'll keep training to conquer it. Excited for more models to join the fray! 🐉🥇.
0
1
7
@PaglieriDavide
Davide Paglieri
28 days
🧑‍🎓is open submission, and run with a small academic-budget. 🔥 We are looking to test all the top models from @GoogleDeepMind, @AnthropicAI, @OpenAI, @Meta and more, and we need your help to do so! So please do reach out!.
4
1
35
@PaglieriDavide
Davide Paglieri
28 days
Evaluating Grok-4 wouldn't have been possible without the help from @HeinrichKuttler @zhaohan_dong and @xai, a big thank you!.
2
0
43
@PaglieriDavide
Davide Paglieri
28 days
Grok-4 barely snatches the podium from Gemini 2.5 Pro, their results being within standard error of each other. One task where @grok still struggles, is NetHack, barely achieving 1.8% progression on BALROG's hardest task. For comparison, Grok-4 got 66% on ARC and 15.9% on ARC-2.
4
1
54
@PaglieriDavide
Davide Paglieri
29 days
RT @GregKamradt: The world is moving towards agents. Static benchmarks don't measure what agents do best (multi-turn reasoning). Thus, inte….
0
30
0