ARC Prize @arcprize profile

ARC Prize

@arcprize

Followers

20K

Following

4K

Media

141

Statuses

436

A North Star for AGI. Co-founders: @fchollet @mikeknoop. President: @gregkamradt. Help support the mission - make a donation today.

Joined March 2024

Don't wanna be here? Send us removal request.

ARC Prize

@arcprize

2 months

Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans). Grand Prize: 85%, ~$0.42/task efficiency. Current Performance:.* Base LLMs: 0%.* Reasoning Systems: <4%

66

344

2K

ARC Prize

@arcprize

5 months

New verified ARC-AGI-Pub SoTA!. @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4

107

641

3K

ARC Prize

@arcprize

1 month

Clarifying o3’s ARC-AGI Performance. OpenAI has confirmed:. * The released o3 is a different model from what we tested in December 2024. * All released o3 compute tiers are smaller than the version we tested. * The released o3 was not trained on ARC-AGI data, not even the train.

37

83

1K

ARC Prize

@arcprize

4 months

Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks). DeepSeek V3:.* Semi-Private: 7.3% ($.002).* Public Eval: 14% ($.002). DeepSeek Reasoner:.* Semi-Private: 15.8% ($.06).* Public Eval: 20.5% ($.05). (Avg $ per task).

19

107

1K

ARC Prize

@arcprize

3 months

Introducing SnakeBench, an experimental benchmark side quest. We made 50 LLMs battle each other in head-to-head snake 🐍. 2.8K matches showed which models are the best at snake real-time strategy and spatial reasoning. Here’s the top match between o3-mini and DeepSeek-R1. 🧵

44

149

1K

ARC Prize

@arcprize

2 months

Gemini-2.5-Pro Experimental Preview Results. ARC-AGI-1.* Public Eval: 24.3%.* Semi Private: 12.5%. ARC-AGI-2.* Public Eval: .8%.* Semi Private: 1.3%. These results are on par with Deepseek's R1.

27

73

1K

ARC Prize

@arcprize

1 month

o3 and o4-mini on ARC-AGI's Semi Private Evaluation. * o3-medium scores 53% on ARC-AGI-1.* o4-mini shows state-of-the-art efficiency.* ARC-AGI-2 remains virtually unsolved (<3%). Through analysis we highlight differences from o3-preview and other model behavior

39

126

1K

ARC Prize

@arcprize

9 months

We put OpenAI o1 to the test against ARC Prize. Results: both o1 models beat GPT-4o. And o1-preview is on par with Claude 3.5 Sonnet. Can chain-of-thought scale to AGI? What explains o1's modest scores on ARC-AGI?. Our notes:.

45

148

854

ARC Prize

@arcprize

4 months

o3-mini performance matches o1 on ARC-AGI-1 Semi-Private Test Set. Scores by reasoning effort:.> Low: 11% ($0.009/task).> Med: 29% ($0.02/task).> High: 35% ($0.04/task)

24

104

810

ARC Prize

@arcprize

3 months

GPT-4.5 Results on ARC-AGI. Semi Private Set (100 hold out tasks):.* Score: 10.33%.* Average Cost per Task: $0.29

43

69

829

ARC Prize

@arcprize

5 months

Verified o1 performance on ARC-AGI's Semi-Private Eval (100 tasks). o1, Low: 25% ($1.5/task).o1, Medium: 31% ($2.5/task).o1, High: 32% ($3.8/task).

30

56

689

ARC Prize

@arcprize

3 months

AGI is reached when the capability gap between humans and computers is zero. ARC Prize Foundation measures this to inspire progress. Today we preview the unbeaten ARC-AGI-2 + open public donations to fund ARC-AGI-3. TY Schmidt Sciences (@ericschmidt) for $50k to kick us off!

24

69

686

ARC Prize

@arcprize

2 months

Llama 4 Maverick and Scout on ARC-AGI's Semi Private Evaluation. Maverick:.* ARC-AGI-1: 4.38% ($0.0078/task).* ARC-AGI-2: 0.00% ($0.0121/task). Scout:.* ARC-AGI-1: 0.50% ($0.0041/task).* ARC-AGI-2: 0.00% ($0.0062/task)

47

48

676

ARC Prize

@arcprize

7 months

New ARC-AGI high score! 53% (Prize goal: 85%) Congratulations, MindsAI!

40

45

595

ARC Prize

@arcprize

6 months

ARC Prize remains unbeaten. In 2024, SoTA moved from 33% to 55.5%. Announcing: ARC Prize 2024 Winners & Technical Report.

21

96

635

ARC Prize

@arcprize

3 months

Claude Sonnet 3.7 + Thinking 1/8/16K results. - Base: 13.6%, $.05/task.- Thinking 1K: 11.6%, $.07/task.- Thinking 8K: 21.1%, $.21/task.- Thinking 16K: 28.6%, $.33/task. Performance is on par with o3-mini for slightly increased cost per task

28

58

640

ARC Prize

@arcprize

5 months

This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation. This is not incremental progress. We're in new territory. Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. 2/4.

4

39

622

ARC Prize

@arcprize

1 month

GPT-4.1 on ARC-AGI's Semi Private Evaluation. GPT-4.1:.* ARC-AGI-1: 5.5% ($0.039/tsk).* ARC-AGI-2: 0.0% ($0.069/tsk). GPT-4.1-Mini:.* ARC-AGI-1: 3.5% ($0.0078/tsk).* ARC-AGI-2: 0.0% ($0.0139/tsk). GPT-4.1-Nano:.* ARC-AGI-1: 0.0% ($0.0021/tsk).* ARC-AGI-2: 0.0% ($0.0036/tsk)

20

58

600

ARC Prize

@arcprize

4 months

R1-Zero matches performance of R1 on ARC-AGI. We’ve verified that R1-Zero scored 14% on ARC-AGI-1 (vs 15% on R1). @mikeknoop explains why R1-Zero is more important than R1, why scaling inference isn’t going away, and what happens when “inference becomes training”. 1/4.

10

69

575

ARC Prize

@arcprize

7 months

New ARC-AGI high score!.55.5% (Prize goal: 85%).Congratulations, MindsAI!

27

36

560

ARC Prize

@arcprize

7 months

New ARC-AGI high score!.54.5% (Prize goal: 85%).Congratulations, MindsAI!

24

34

535

ARC Prize

@arcprize

3 months

QwQ-32B on ARC-AGI. * Public Eval: 11.25%, $0.036 per task.* Semi Private: 7.5%, $0.039 per task

10

40

521

ARC Prize

@arcprize

7 months

Claude 3.5 Sonnet (new) scores pass@1 20.3% on 400 ARC-AGI public eval tasks. Original 3.5 Sonnet: 21%.

37

29

504

ARC Prize

@arcprize

5 months

Today, alongside our analysis of o3's ARC-AGI-Pub performance, we're also releasing data (results, attempts, and prompt) from our high-compute testing. o3 was unable to solve ~9% set of Public Eval tasks that are straightforward for humans. Curious to see why?. We invite the.

20

55

476

ARC Prize

@arcprize

8 months

New ARC-AGI high score! 48% (Prize goal: 85%).Congratulations, MindsAI!

26

33

433

ARC Prize

@arcprize

3 months

Wow! One of our donors has anonymously decided to materially increase their support to $1M!. This fully funds our 2025 goal in just 1 day. With this support, we’ll launch v2, build v3, and continue driving progress in measuring AGI

ARC Prize

@arcprize

3 months

We're not done - @bryanhelmig just pledged $15K to ARC Prize.

11

22

376

ARC Prize

@arcprize

1 year

Announcing ARC Prize. A $1M+ competition to beat the ARC-AGI benchmark and open source the solution. Hosted by @mikeknoop & @fchollet.

24

110

374

ARC Prize

@arcprize

5 months

Previously shared, ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% on the latest ARC-AGI is created. 3/4.

3

16

374

ARC Prize

@arcprize

8 months

New ARC-AGI high score! 49% (Prize goal: 85%).Congratulations, MindsAI!

11

22

343

ARC Prize

@arcprize

5 months

Read our full o3 testing report and @fchollet's perspective on this exciting breakthrough, the future of the ARC-AGI benchmark, and the path to AGI. 4/4.

3

30

326

ARC Prize

@arcprize

8 months

New ARC-AGI high score! 47% (Prize goal: 85%) Congratulations, MindsAI!

9

23

294

ARC Prize

@arcprize

10 months

New ARC-AGI high score! 43% (Prize goal: 85%).Congratulations, MindsAI!

12

17

303

ARC Prize

@arcprize

2 months

3/24/2025

12

23

318

ARC Prize

@arcprize

2 months

New ARC Prize 2025 High Score: 10.1%. by @guille_bar

10

19

316

ARC Prize

@arcprize

9 months

New ARC-AGI high score! 46% (Prize goal: 85%) Congratulations, MindsAI!

5

31

283

ARC Prize

@arcprize

6 months

On Dec. 6. We'll announce the winners of ARC Prize 2024, including top score & paper award progress prizes. And we'll publish a paper documenting state-of-the-art approaches to ARC-AGI. We're now reviewing paper submissions and verifying the leaderboard. Stay tuned. .

10

18

269

ARC Prize

@arcprize

5 months

The Next Chapter: ARC Prize Foundation. Beyond the benchmark - the North Star for AGI. We're excited to announce important updates to our leadership, entity structure, and initiatives for 2025. 1/5

7

22

263

ARC Prize

@arcprize

7 months

[Paper] One approach to solve ARC-AGI is to learn a domain-specific language from the training set and add to the DSL on-the-fly when faced with novel tasks.

7

35

257

ARC Prize

@arcprize

3 months

Novel test-time-training method to solve ARC-AGI without pretraining. "CompressARC achieves 34.75% on the training set and 20% on the evaluation set".

Isaac Liao

@LiaoIsaac91893

3 months

Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4

4

22

253

ARC Prize

@arcprize

7 months

[Paper] Dreamcoder's inductive program synthesis has inspired many ARC-AGI approaches. By combining neural networks + symbolic abstractions, it can tackle tasks from programming to physics.

7

34

247

ARC Prize

@arcprize

2 months

ARC Prize 2025 Leaders. 2 weeks in, 7 months to go. The Grand Prize is still unclaimed

12

17

248

ARC Prize

@arcprize

11 months

New ARC-AGI high score! 39% (Prize goal: 85%). Congratulations, MindsAI!

6

18

238

ARC Prize

@arcprize

11 months

New ARC-AGI high score! 38% (Prize goal: 85%). Congratulations, MindsAI!

6

15

234

ARC Prize

@arcprize

10 months

New ARC-AGI high score! 41% (Prize goal: 85%) Congratulations, MindsAI!

5

17

211

ARC Prize

@arcprize

8 months

ARC Prize is now 3 months old - we're announcing:. 🏆 +$100K Grand Prize (now $600k).📜 +$25K Paper Awards (now $75k). And we're committing funds for a US university tour in October and the development of the next iteration of ARC-AGI.

6

30

205

ARC Prize

@arcprize

2 months

Every ARC-AGI-2 task, however, is solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

6

9

218

ARC Prize

@arcprize

10 months

New ARC-AGI high score! 42% (Prize goal: 85%).Congratulations, MindsAI!

5

10

208

ARC Prize

@arcprize

9 months

One goal for ARC Prize was to provide a public measure of progress towards AGI. Here's what we see now when new models like o1 come out.

4

20

213

ARC Prize

@arcprize

14 days

New ARC Prize 2025 High Score. 15.3% by @MindsAI_Jack, @MohamedOsmanML, @tufalabs

6

15

220

ARC Prize

@arcprize

2 months

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 and o3-mini) score <4%. Even AI systems with high adaptation like o1 pro and o3 low score single-digits (est.)

7

14

212

ARC Prize

@arcprize

5 months

Watch the finale of "12 Days of @OpenAI" livestream for a big announcement, starting in 3 minutes.

7

12

206

ARC Prize

@arcprize

2 months

Our belief is that once we can no longer come up with quantifiable problems that are relatively easy for humans, yet hard for AI, we have reached AGI. ARC-AGI-2 proves that we do not have AGI. New ideas are still needed!.

3

9

206

ARC Prize

@arcprize

3 months

ARC-AGI-1 was designed to challenge deep learning. ARC-AGI-2 challenges reasoning systems – while still maintaining a 100% human solve rate. Early results show frontier AI systems scoring 10-20% on ARC-AGI-2 and we're launching it March 2025. This gap demonstrates that we have

10

16

199

ARC Prize

@arcprize

7 months

Deep learning is not enough to beat ARC Prize. We need something more. @mikeknoop & @fchollet share a path to defeat ARC-AGI via Program Synthesis.

6

33

192

ARC Prize

@arcprize

2 months

Are You Smarter Than A.I.?. An interactive article by @nytimes covers @arcprize and @fchollet . "Some experts predict that A.I. will surpass human intelligence within the next few years. Play this puzzle to see how far the machines have to go."

4

30

191

ARC Prize

@arcprize

4 months

DeepSeek performance is on par, albeit slightly lower, with o1-preview.

2

12

187

ARC Prize

@arcprize

10 months

Not only can children solve ARC-AGI tasks, they can create them, too.

9

20

153

ARC Prize

@arcprize

2 months

ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization in late 2024 demonstrated by OpenAI's o3 system. Now, ARC-AGI-2 raises the bar significantly, challenging known test-time adaptation methods. @MLStreetTalk is helping us launch ARC-AGI-2 with an interview.

1

4

183

ARC Prize

@arcprize

1 month

New ARC Prize 2025 High Score. 12.4% by @MindsAI_Jack, @MohamedOsmanML @ @tufalabs

7

9

177

ARC Prize

@arcprize

3 months

Inspired by @karpathy's recent tweet - games are a great test environment for AI. They require:.• Real-time decisions.• Multiple objectives.• Spatial reasoning.• Dynamic environments. So we built SnakeBench to explore how LLMs would do.

Andrej Karpathy

@karpathy

4 months

I quite like the idea using games to evaluate LLMs against each other, instead of fixed evals. Playing against another intelligent entity self-balances and adapts difficulty, so each eval (/environment) is leveraged a lot more. There's some early attempts around. Exciting area.

5

4

178

ARC Prize

@arcprize

2 months

Symbolic interpretation. Frontier AI reasoning systems struggle with tasks requiring symbols to be interpreted as having meaning beyond their visual patterns. Systems attempted symmetry checking, mirroring, transformations, and even recognized connecting elements, but failed to

4

8

180

ARC Prize

@arcprize

9 months

"Abstraction is the key to generalization." @fchollet. Where we are now vs where we are going on the spectrum of abstraction.

7

28

164

ARC Prize

@arcprize

9 months

[Paper] Neural diffusion models that. Invert "noise" applied to syntax trees. Iteratively edit code while preserving syntactic validity, making it easy to combine the model w/search. Learn to convert images into programs that produce those images.

5

29

172

ARC Prize

@arcprize

2 months

ARC Prize 2025 is Live. $1M competition to open source a solution to ARC-AGI. Your objective: Reach 85% on the private evaluation dataset. Progress needs new ideas, not just scale

7

34

166

ARC Prize

@arcprize

2 months

Intelligence isn't just capability; it's efficiency. We can no longer report performance as a single metric. Going forward our leaderboard will track the *cost* of performance as a first class citizen. ARC-AGI-2 is showing material resistance over ARC-AGI-1 towards reasoning

3

9

164

ARC Prize

@arcprize

1 month

o3-medium does great on ARC-AGI-1, scoring 41% (low) and 53% (medium). This is currently the top public model score we've verified. o4-mini shows SOTA levels of efficiency costing only $0.05/task for 21% accuracy. In contrast, o1-pro (low compute) requires ~$11/task for the

3

8

163

ARC Prize

@arcprize

10 months

New baseline score on ARC-AGI-Pub. GPT-4o mini: 3.6%. This score is against the public evaluation set

20

12

156

ARC Prize

@arcprize

2 months

ARC-AGI-2 isn't about superhuman skills; it's about exposing what's missing in current AI: efficient acquisition of new skills. It challenges capabilities like symbolic interpretation, compositional reasoning, and contextual rule application.

1

7

162

ARC Prize

@arcprize

10 months

Leading AGI research lab DeepMind is implementing nearly identical techniques we're seeing at the top of ARC Prize leaderboards. 1. test time fine-tuning.2. blast inference + search. AlphaProof = closed source.ARC Prize winning solutions = open source (by end of 2024)

7

14

157

ARC Prize

@arcprize

9 months

[Paper] "Automated Design of Agentic Systems" features a case study on ARC-AGI. This approach - Meta Agent Search - progressively discovers novel agents that outperform state-of-the-art, hand-designed agents.

2

25

154

ARC Prize

@arcprize

9 months

The spectrum of key factors that define intelligence. 1. Fluidity: Memorization vs skill synthesis. 2. Operational area: Low vs high abstraction in skill usage. 3. Information efficiency: High vs. low amount of data to acquire skills. @fchollet

3

20

105

ARC Prize

@arcprize

4 months

DeepSeek's R1-Zero is significant because it achieves strong reasoning performance *without* human-labeled data (SFT). It only relies on Reinforcement Learning (RL). This overcomes the friction of human data bottlenecks. 2/4.

1

9

150

ARC Prize

@arcprize

2 months

We're looking for a creative Unity game developer to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills:.* Strong Unity Editor and C# programming.* 2 years of game development experience.* Strong

13

25

147

ARC Prize

@arcprize

9 months

[Paper] Current high-scoring team member @bayesilicon shares an ARC-AGI training task generator. More examples ". should enable a wide range of experiments that may be important stepping stones towards making leaps on the benchmark .".

3

14

143

ARC Prize

@arcprize

21 days

ARC-AGI-2 has been added to @huggingface's Lighteval. As you evaluate your models with Lighteval, ARC-AGI-2 will now be featured as an output.

Nathan

@nathanhabib1011

22 days

🔥 Evaluating LLMs? You need Lighteval — the fastest, most flexible toolkit for benchmarking models, built by @huggingface. Now with:.✅ Plug & play custom model inference (evaluate any backend).📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more. Details below 🧵👇.

8

18

141

ARC Prize

@arcprize

7 months

ARC Prize 2024 is now closed for code submissions! 🏁. Thank you to everyone who participated. We made incredible progress on ARC-AGI. Next: paper deadline Tuesday + a review period where the Kaggle & ARC Prize teams will verify winning solutions. Winners announced Dec. 6.

3

12

135

ARC Prize

@arcprize

2 months

Compositional Reasoning. AI reasoning systems struggle with tasks requiring simultaneous application of rules, or application of multiple rules that interact with each other. In contrast, if a task only has one, or very few, global rules, we found these systems can consistently

2

6

135

ARC Prize

@arcprize

2 months

.@LiaoIsaac91893 has open sourced his "ARC-AGI Without Pretraining" notebook on Kaggle. You can use it today and enter ARC Prize 2025. It currently scores 4.17% on ARC-AGI-2 (5th place). Amazing mid-year sharing and contribution. Thank you Isaac.

Isaac Liao

@LiaoIsaac91893

3 months

Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4

2

8

130

ARC Prize

@arcprize

9 months

[Paper] ARC-AGI remains unsolved despite its seemingly simple content and evaluation methods. To overcome the limitations of result-oriented analyses, this study embraces the Language of Thought Hypothesis (LoTH).

3

17

113

ARC Prize

@arcprize

9 months

[Paper] "A very cool example of how you can use LLMs with discrete program search. getting a 2x improvement on ARC-AGI." @fchollet #AGI24.

3

9

126

ARC Prize

@arcprize

7 months

"Spatially-aware tokenization scheme" is a concept @fchollet has been speaking about on our university tour. Transformers aren't just for sequences - they can be made to work with any kind of data structure, including grids.

Wenhao Li

@WenhaoLi29

7 months

We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop’s @arcprize. Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔

2

10

124

ARC Prize

@arcprize

4 months

"Inference as Training.". Reasoning systems can generate high-quality data during inference, which can then be used to further train and improve the model. This creates a powerful feedback loop and a potential runaway effect for companies with large user bases. 3/4.

1

8

125

ARC Prize

@arcprize

3 months

We've also begun early design work on ARC-AGI-3 to endure 3+ years, launching in 2026. We're exploring ARC-like games to formally codify efficiency, require new capabilities like goal orientation, massively raise the bar for on-the-fly skill acquisition, and resist large-scale

4

8

116

ARC Prize

@arcprize

2 months

Behind the scenes building ARC-AGI. Last month we hosted a private launch of ARC-AGI-2 for sponsors and labs. @fchollet went deep on how we made ARC-AGI-2, performance testing, and the dataset change log

13

17

115

ARC Prize

@arcprize

1 month

Key finding #1: Early responses showed higher accuracy. We noticed that tasks which the model returned sooner had higher accuracy. Those that took longer, either in duration or token usage, were more likely to fail. This signals that the model comes to conclusion or has higher

2

3

118

ARC Prize

@arcprize

2 months

ARC Prize 2025 ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition had over 1.5K teams participate and 40 research papers published.

1

6

115

ARC Prize

@arcprize

2 months

The Rise of Fluid Intelligence. "@fchollet is on a quest to make AI a bit more human". A thorough article by @matteo_wong explaining ARC-AGI and @arcprize

7

18

111

ARC Prize

@arcprize

9 months

[Paper] This approach performs program synthesis using grid descriptions rather than transformation sequences. It scores decently well on the ARC-AGI public eval set without a handcrafted DSL.

4

14

103

ARC Prize

@arcprize

1 month

After launching ARC-AGI-2, the community quickly pointed out some inaccuracies in our public eval sets. We're grateful for this helpful feedback, it makes our benchmark stronger and more reliable for everyone involved. In response, we performed a thorough internal and external.

0

4

107

ARC Prize

@arcprize

7 months

Paper submissions (due today by 11:59pm UTC) are flowing in!. We're excited to share these intriguing approaches from top universities, AI labs, and Fortune 500 companies soon. Review paper + winners announcement coming Dec. 6.

7

9

103

ARC Prize

@arcprize

9 months

[Paper] ARC-AGI is hard: a vast action space, hard-to-reach goals, a variety of tasks. ARCLE is an environment for reinforcement learning research on ARC that shows agents can learn via proximal policy optimization.

1

11

93

ARC Prize

@arcprize

11 months

Can you solve it? Reply with the answer.

21

5

99

ARC Prize

@arcprize

4 months

Thank you to @rishab_partha for helping with this analysis. The purpose of the 100 Semi-Private Tasks is to provide a secondary hold out test set score. The 400 Public Eval tasks were published in 2019. They have been widely studied and included in other model training data.

1

2

100

ARC Prize

@arcprize

9 months

[Paper] Test-Time Training (TTT) layers show an ability to dynamically adapt & learn during test time, which is essential for the diverse & novel problems in ARC-AGI. This approach outperforms Transformers & RNNs in long-context tasks.

1

12

97

ARC Prize

@arcprize

2 months

Contextual Rule Application. AI reasoning systems struggle with tasks where rules must be applied differently based on context. Systems tend to fixate on superficial patterns rather than understanding the underlying selection principles.

1

4

100

ARC Prize

@arcprize

5 months

We’re working on more OpenAI model analysis that we’ll share soon. Stay tuned….

4

5

97

ARC Prize

@arcprize

3 months

Key findings from SnakeBench:. 1. Reasoning models dominated - o3-mini and DeepSeek won 78% of their matches. 2. Context is crucial - Models still needed extensive board data and clear coordinate systems to play effectively

3

5

99

ARC Prize

@arcprize

2 months

This year, we've over 2X’d compute, strengthened open-source requirements, and adjusted scoring to incentivize conceptual breakthroughs, not just leaderboard climbing. The prize categories are simple:.* Most Significant Conceptual Contribution ($75K).* Highest Score ($50K). The.

1

2

98

ARC Prize

@arcprize

1 month

During testing, both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. The partial results we did receive are in the blog post. However, these reasoning efforts were excluded from the leaderboard due to insufficient coverage.

3

96

ARC Prize

@arcprize

1 month

Key finding #2: Higher reasoning can be inefficient. When comparing o3-medium and o3-high on the same tasks, we found that o3-high consistently used more tokens to arrive at the same answers. While this isn’t surprising, it highlights a key tradeoff: o3-high can offer no accuracy

1

3

97