Jan Disselhoff
@JDisselh
Followers
60
Following
65
Media
5
Statuses
34
Deep Learning Scientist | The ARChitects Kaggle Team
Joined November 2025
ARC Prize 2025 is over, an amazing contest, with amazing people competing. This year our team "the ARChitects" managed to reach second place. We tried a lot of things, some thoughts and explanation of our approach below!
2
1
33
DPO pushed baguettotron so far into unreadable experimental land that I didn't like it however skipping straight from SFT to GRPO is producing moments that make me forget that this model is only 371M params GRPO w mostly format reward (</think>, title, length), a huge
baguettotron poetry llm experiments complete and to come: - train baguettotron bradley-terry reward model on 10k kimi vs gemma 3n poems (failed, look at data, reward hacking formatting quirks) - sft baguettotron on 10k kimi poems and reverse-engineered SYNTH reasoning traces
3
1
43
Ivan Sorokin and I are the official winners on the Arc Prize competition, with a significant lead over other teams. Thanks to @kaggle and @arcprize for hosting the competition. NVIDIA tech blog summarizing what we did: https://t.co/BU8nHPCliJ Our writeup:
37
52
518
ARC Prize 2025 Winners Interviews Top Score 1st Place NVARC (@JFPuget, Ivan Sorokin) detail their synthetic-data-driven ensemble of an improved ARChitects-style, test-time-trained model + TRM-based components that reaches ~24% on ARC-AGI-2 under Kaggle contest constraints.
3
12
94
Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop
23
49
317
One ingredient of our solution is the Tiny Recursive Model of @jm_alexia . During the competition we got a score of 10% on the semi private dataset of arc agi2, and 10.41% on the public eval dataset. I further trained TRM for 10 more days using the same recipe as in our
We also appear on the ARC AGI2 leaderboard. Not best score, but clearly on the Pareto frontier with a much lower cost than best scores.
6
13
151
(P.S. I vaguely remember some paper that merges word embeddings to reduce token counts in LLMs that I wanted to link, but can't find anymore. If anyone knows what I am talking about hmu, or share below!)
1
0
2
All in all an amazing experience! Huge thanks to the organizers @arcprize and congratulations to the other winning teams and papers! Definitely check them out at
arcprize.org
Prize information, rules, and key dates.
1
0
4
The things above were the things that worked, but of course we had a lot of approaches that did not pan out. Most frustrating one: Dave invested a lot of time into synthetic data generation, which was the approach that the first place NVARC team used! (More examples in blog)
1
0
4
Recurrence seems a popular approach currently, and I think that our final approach is somewhat close to HRM/TRM style solutions. Adding some intermediary reasoning tokens would make them even more similar, and is something we might test in the future.
1
0
2
Even though this is not how the model was trained, it handles this very well and suddenly was able to fix errors and solve problems it previously struggled with! That insight allowed us to increase our score to our final leaderboard score
1
0
4
In the last weeks of the competition, Dave and Daniel had a breakthrough: They stopped using the masked diffusion as such, and did not fully demask tokens! Instead partial solutions are combined with mask tokens and recurrently fed back into the LLM!
1
0
2
The issue was that our approach from last year was incredibly effective at utilizing additional compute, but we had no way of doing the same for the masked diffusion model! The base model was stronger, but did not scale well in inference!
1
0
2
All of this allowed us to build a masked diffusion model that was able to solve ARC tasks. We had amazing performance on the public eval dataset in our tests and then... could not increase our score on the leaderboard...
1
0
2
Additionally we experimented with manipulating positional embeddings, to allow the model a better understanding of the 2D structure (see https://t.co/IEgWUEZJka). This helped, but less than expected. RoPE is surprisingly adaptable even to problems it was not designed to handle.
1
0
2
While there are issues with that in natural language, on ARC this can be of great benefit since we can predict parts of the problem very easily (such as background), and puzzle tasks become much simpler. For issues see for example here: https://t.co/Ez2yqMMZCV
Diffusion LLMs (DLLM) can do βany-orderβ generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: β‘οΈ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This
1
0
2
However, we were also working on a different approach, using Masked Diffusion LLMs based on LLaDA ( https://t.co/accFw5lW8A). While these models are often cited for their inference speed, we were far more interested in their ability to choose the order in which they unmask tokens!
github.com
Official PyTorch implementation for "Large Language Diffusion Models" - ML-GSAI/LLaDA
1
0
6
Using our old method with some inference optimizations therefore saturated at ~17 points on the public leaderboard, 14.17 points on the private score.
1
0
2
However, we saw that this method had a hard time on ARC-2. Die to the Autoregressive nature, it was easy to make early prediction mistakes that the model is then unable to fix. Our method struggled especially on puzzle and simulation tasks, as well as at predicting diagonals.
1
0
2
When the contest began, we were still working on optimizing our approach from the previous year, which won ARC-2024. It used finetuned LLMs with a custom sampling method and a selection scheme that allowed us to leverage test-time compute very efficiently!
1
0
2