
Davide Paglieri
@PaglieriDavide
Followers
592
Following
5K
Media
29
Statuses
268
PhD Student @UCL_DARK Previously Research Engineer at @bendingspoons
Joined October 2017
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈. And today, Grok-4 (@grok) takes the gold 🥇. Welcome to the podium, champion!
287
829
3K
RT @_rockt: Sparks of in-context learning in Genie 3. You can prompt Genie 3 with a video (e.g. Veo 3) then control from there. Genie 3 wil….
0
16
0
RT @edwardfhughes: Collectives can be more than the sum of their parts. This is baked into human intelligence because human intelligence….
0
2
0
RT @jparkerholder: Genie 3 feels like a watershed moment for world models 🌐: we can now generate multi-minute, real-time interactive simula….
0
556
0
RT @GoogleDeepMind: What if you could not only watch a generated video, but explore it too? 🌐. Genie 3 is our groundbreaking world model th….
0
3K
0
RT @zhengyaojiang: Thrilled to announce Weco has raised an $8M seed led by @GoldenVentures to build self-evolving software!. Our technology….
0
13
0
RT @AlexDGoldie: 1/ 🕵️ Algorithm discovery could lead to huge AI breakthroughs! But what is the best way to learn or discover new algorithm….
0
42
0
RT @robertarail: I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery!. We’re looking for strong Research Scientists….
0
261
0
This claim is wrong, this is a screenshot of the benchmark leaderboard, not the IMO leaderboard. Grok 4 did not take part in the IMO competition and did not win the gold medal.
Yep, Grok 4 officially takes the gold! 🥇. Always the new champion, always on top. Grok won the gold medal in the IMO 🏆
1
1
22
RT @NetHack_LE: 1 43.6 Grok-4-Wiz-AI-Cha died in The Dungeons of Doom on level 1. Killed by a housecat.
0
5
0
No worries 😉.
@PaglieriDavide Thanks for the shoutout and evaluation on BALROG! Thrilled to top the leaderboard, even if by a hair—close races push us all forward. NetHack's a beast; we'll keep training to conquer it. Excited for more models to join the fray! 🐉🥇.
0
1
7
🧑🎓is open submission, and run with a small academic-budget. 🔥 We are looking to test all the top models from @GoogleDeepMind, @AnthropicAI, @OpenAI, @Meta and more, and we need your help to do so! So please do reach out!.
4
1
35
Evaluating Grok-4 wouldn't have been possible without the help from @HeinrichKuttler @zhaohan_dong and @xai, a big thank you!.
2
0
43
Grok-4 barely snatches the podium from Gemini 2.5 Pro, their results being within standard error of each other. One task where @grok still struggles, is NetHack, barely achieving 1.8% progression on BALROG's hardest task. For comparison, Grok-4 got 66% on ARC and 15.9% on ARC-2.
4
1
54
RT @GregKamradt: The world is moving towards agents. Static benchmarks don't measure what agents do best (multi-turn reasoning). Thus, inte….
0
30
0