mikb0b Profile Banner
Mikel Bober-Irizar Profile
Mikel Bober-Irizar

@mikb0b

Followers
8K
Following
6K
Media
78
Statuses
1K

23 // Kaggle Competitions Grandmaster & ML/AI Researcher. Building video games @iconicgamesio, machine reasoning @Cambridge_CL, bioscience @ForecomAI.

London
Joined August 2011
Don't wanna be here? Send us removal request.
@mikb0b
Mikel Bober-Irizar
7 months
Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵
Tweet media one
19
73
668
@mikb0b
Mikel Bober-Irizar
4 months
Really good to be back in SF for GDC (yes, our game is still cooking 👀). If you're around and want to meet up next week, let me know!
Tweet media one
0
0
6
@mikb0b
Mikel Bober-Irizar
7 months
RT @GregKamradt: Seeing this chart go around a bunch, I think the main point is being missed. - “LLMs can’t solve large grids because of pe….
0
2
0
@mikb0b
Mikel Bober-Irizar
7 months
RT @simone_m_romeo: I recommend reading @mikb0b 's article on o3's performance on the ARC challenge. He proves that LLMs' struggle with ARC….
0
1
0
@mikb0b
Mikel Bober-Irizar
7 months
For a deeper analysis of why o3 did so much better than previous models, and the caveats there might be in that evaluation, check out this thread!.
@mikb0b
Mikel Bober-Irizar
7 months
Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵
Tweet media one
0
0
5
@mikb0b
Mikel Bober-Irizar
7 months
RT @olcan: more evidence (including experiments varying sizes of problems) that grid size alone plays a significant role in arc. this is ob….
0
1
0
@mikb0b
Mikel Bober-Irizar
7 months
When models can't understand the task format, the benchmark can mislead, introducing a hidden threshold effect. And if there's always a larger version that humans can solve but an LLM can't, what does this say about scaling to AGI?. Read the article here:
Tweet media one
10
11
203
@mikb0b
Mikel Bober-Irizar
7 months
So even if a model is capable of the reasoning and generalization required, it can still fail just because it can't handle this many tokens. When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks - even if the solutions are the same.
Tweet media one
7
4
186
@mikb0b
Mikel Bober-Irizar
7 months
LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues - ARC task difficulty is independent of size. Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.
Tweet media one
11
25
310
@mikb0b
Mikel Bober-Irizar
7 months
The full post goes through all these examples, and I'd love to hear your thoughts and theories:
7
2
69
@mikb0b
Mikel Bober-Irizar
7 months
And on a couple occasions - the model appears to give up on the 2nd attempt, in this case outputting a single black pixel? . It's unclear if OpenAI feeds in the previous attempt and tells the model it was wrong, or if something else led to this behaviour.
Tweet media one
3
2
68
@mikb0b
Mikel Bober-Irizar
7 months
In several cases, we see o3 struggle to output an aligned grid at all. It seems that problems which require outputting repeated rows can make o3 struggle to keep track.
Tweet media one
Tweet media two
Tweet media three
6
6
103
@mikb0b
Mikel Bober-Irizar
7 months
Here's the post with all the examples and analysis: You might have seen this task today on x dot com as a failure! I actually think o3's answer is as valid as the ground truth here
Tweet media one
8
11
200
@mikb0b
Mikel Bober-Irizar
7 months
You've seen some of the puzzles o3 failed, but have you seen the attempts?. Yesterday, @OpenAI's o3 dramatically beat the SOTA at @arcprize. But there were 34 tasks that even it couldn't solve with 16 hours of thinking. I've compiled and analyzed all of o3's mistakes below 🧵
Tweet media one
35
157
1K
@mikb0b
Mikel Bober-Irizar
7 months
RT @MeganRisdal: Really great to meet and catch up with @mikb0b in person after many years! 😄
Tweet media one
0
3
0
@mikb0b
Mikel Bober-Irizar
1 year
I'm heading back to San Francisco for @Official_GDC 🎮 - if anyone's around the bay area late March and wants to meet up let me know!.
1
0
8
@mikb0b
Mikel Bober-Irizar
2 years
I'll be speaking at @NVIDIA's AI & DS Virtual Summit about the journey to becoming the youngest Kaggle Grandmaster, along with @Rob_Mulla and @kagglingdieter. 🔥. Come and join us for a live Q&A on Wednesday 9th at 12pm PT (for free!) @NVIDIAAI
Tweet media one
1
13
93
@mikb0b
Mikel Bober-Irizar
2 years
I'm going to be in San Francisco in early November! ✈️. If anyone's in the bay area and wants to meet up, or if anyone knows any events I should check out, let me know! 😊.
1
0
2
@mikb0b
Mikel Bober-Irizar
2 years
I've recently been playing with @fchollet's Abstraction and Reasoning Corpus, a really interesting benchmark for building systems that can reason. As part of that, I've just released a small 🐍 library for easily interacting with and visualising ARC:
Tweet media one
2
38
219
@mikb0b
Mikel Bober-Irizar
2 years
Really proud to be published in a Nature Portfolio journal for the first time!. We set a new SOTA for single-cell protein localisation on the @ProteinAtlas, building on our work in the 2nd HPA Kaggle comp. @ForecomAI @cvssp_research @d_minskiy.
3
5
23