Mikel Bober-Irizar @mikb0b X Profile

Mikel Bober-Irizar

@mikb0b

Followers

8K

Following

6K

Media

78

Statuses

1K

23 // Kaggle Competitions Grandmaster & ML/AI Researcher. Building video games @iconicgamesio, machine reasoning @Cambridge_CL, bioscience @ForecomAI.

London

Joined August 2011

Don't wanna be here? Send us removal request.

Mikel Bober-Irizar

@mikb0b

7 months

Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵

19

73

668

Mikel Bober-Irizar

@mikb0b

4 months

Really good to be back in SF for GDC (yes, our game is still cooking 👀). If you're around and want to meet up next week, let me know!

0

6

Mikel Bober-Irizar

@mikb0b

7 months

RT @GregKamradt: Seeing this chart go around a bunch, I think the main point is being missed. - “LLMs can’t solve large grids because of pe….

0

2

0

Mikel Bober-Irizar

@mikb0b

7 months

RT @simone_m_romeo: I recommend reading @mikb0b 's article on o3's performance on the ARC challenge. He proves that LLMs' struggle with ARC….

0

1

0

Mikel Bober-Irizar

@mikb0b

7 months

For a deeper analysis of why o3 did so much better than previous models, and the caveats there might be in that evaluation, check out this thread!.

Mikel Bober-Irizar

@mikb0b

7 months

Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵

0

5

Mikel Bober-Irizar

@mikb0b

7 months

RT @olcan: more evidence (including experiments varying sizes of problems) that grid size alone plays a significant role in arc. this is ob….

0

1

0

Mikel Bober-Irizar

@mikb0b

7 months

When models can't understand the task format, the benchmark can mislead, introducing a hidden threshold effect. And if there's always a larger version that humans can solve but an LLM can't, what does this say about scaling to AGI?. Read the article here:

10

11

203

Mikel Bober-Irizar

@mikb0b

7 months

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can't handle this many tokens. When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks - even if the solutions are the same.

7

4

186

Mikel Bober-Irizar

@mikb0b

7 months

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues - ARC task difficulty is independent of size. Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

11

25

310

Mikel Bober-Irizar

@mikb0b

7 months

The full post goes through all these examples, and I'd love to hear your thoughts and theories:

7

2

69

Mikel Bober-Irizar

@mikb0b

7 months

And on a couple occasions - the model appears to give up on the 2nd attempt, in this case outputting a single black pixel? . It's unclear if OpenAI feeds in the previous attempt and tells the model it was wrong, or if something else led to this behaviour.

3

2

68

Mikel Bober-Irizar

@mikb0b

7 months

In several cases, we see o3 struggle to output an aligned grid at all. It seems that problems which require outputting repeated rows can make o3 struggle to keep track.

6

103

Mikel Bober-Irizar

@mikb0b

7 months

Here's the post with all the examples and analysis: You might have seen this task today on x dot com as a failure! I actually think o3's answer is as valid as the ground truth here

8

11

200

Mikel Bober-Irizar

@mikb0b

7 months

You've seen some of the puzzles o3 failed, but have you seen the attempts?. Yesterday, @OpenAI's o3 dramatically beat the SOTA at @arcprize. But there were 34 tasks that even it couldn't solve with 16 hours of thinking. I've compiled and analyzed all of o3's mistakes below 🧵

35

157

1K

Mikel Bober-Irizar

@mikb0b

7 months

RT @MeganRisdal: Really great to meet and catch up with @mikb0b in person after many years! 😄

0

3

0

Mikel Bober-Irizar

@mikb0b

1 year

I'm heading back to San Francisco for @Official_GDC 🎮 - if anyone's around the bay area late March and wants to meet up let me know!.

1

0

8

Mikel Bober-Irizar

@mikb0b

2 years

I'll be speaking at @NVIDIA's AI & DS Virtual Summit about the journey to becoming the youngest Kaggle Grandmaster, along with @Rob_Mulla and @kagglingdieter. 🔥. Come and join us for a live Q&A on Wednesday 9th at 12pm PT (for free!) @NVIDIAAI

1

13

93

Mikel Bober-Irizar

@mikb0b

2 years

I'm going to be in San Francisco in early November! ✈️. If anyone's in the bay area and wants to meet up, or if anyone knows any events I should check out, let me know! 😊.

1

0

2

Mikel Bober-Irizar

@mikb0b

2 years

I've recently been playing with @fchollet's Abstraction and Reasoning Corpus, a really interesting benchmark for building systems that can reason. As part of that, I've just released a small 🐍 library for easily interacting with and visualising ARC:

2

38

219

Mikel Bober-Irizar

@mikb0b

2 years

Really proud to be published in a Nature Portfolio journal for the first time!. We set a new SOTA for single-cell protein localisation on the @ProteinAtlas, building on our work in the 2nd HPA Kaggle comp. @ForecomAI @cvssp_research @d_minskiy.

3

5

23