Jan Disselhoff @JDisselh X Profile

Jan Disselhoff

@JDisselh

Followers

60

Following

65

Media

5

Statuses

34

Deep Learning Scientist | The ARChitects Kaggle Team

Joined November 2025

Don't wanna be here? Send us removal request.

Jan Disselhoff

@JDisselh

3 days

ARC Prize 2025 is over, an amazing contest, with amazing people competing. This year our team "the ARChitects" managed to reach second place. We tried a lot of things, some thoughts and explanation of our approach below!

2

1

33

ARC Prize

@arcprize

2 days

ARC Prize 2025 Winners Interviews Top Score 2nd Place The ARChitects (@dvhrtm, @JDisselh, Daniel Franzen) detail their 2D-aware, masked-diffusion LLM w/recursive self-refinement + perspective-based scoring - improving substantially over the team's 2024 autoregressive system.

1

5

45

darren

@darrenangle

3 days

DPO pushed baguettotron so far into unreadable experimental land that I didn't like it however skipping straight from SFT to GRPO is producing moments that make me forget that this model is only 371M params GRPO w mostly format reward (</think>, title, length), a huge

darren

@darrenangle

8 days

baguettotron poetry llm experiments complete and to come: - train baguettotron bradley-terry reward model on 10k kimi vs gemma 3n poems (failed, look at data, reward hacking formatting quirks) - sft baguettotron on 10k kimi poems and reverse-engineered SYNTH reasoning traces

3

1

43

JFPuget 🇺🇦🇨🇦🇬🇱

@JFPuget

5 days

Ivan Sorokin and I are the official winners on the Arc Prize competition, with a significant lead over other teams. Thanks to @kaggle and @arcprize for hosting the competition. NVIDIA tech blog summarizing what we did: https://t.co/BU8nHPCliJ Our writeup:

37

52

518

ARC Prize

@arcprize

4 days

ARC Prize 2025 Winners Interviews Top Score 1st Place NVARC (@JFPuget, Ivan Sorokin) detail their synthetic-data-driven ensemble of an improved ARChitects-style, test-time-trained model + TRM-based components that reaches ~24% on ARC-AGI-2 under Kaggle contest constraints.

3

12

94

ARC Prize

@arcprize

5 days

Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop

23

49

317

JFPuget 🇺🇦🇨🇦🇬🇱

@JFPuget

5 days

One ingredient of our solution is the Tiny Recursive Model of @jm_alexia . During the competition we got a score of 10% on the semi private dataset of arc agi2, and 10.41% on the public eval dataset. I further trained TRM for 10 more days using the same recipe as in our

JFPuget 🇺🇦🇨🇦🇬🇱

@JFPuget

5 days

We also appear on the ARC AGI2 leaderboard. Not best score, but clearly on the Pareto frontier with a much lower cost than best scores.

6

13

151

Jan Disselhoff

@JDisselh

3 days

(P.S. I vaguely remember some paper that merges word embeddings to reduce token counts in LLMs that I wanted to link, but can't find anymore. If anyone knows what I am talking about hmu, or share below!)

1

0

2

Jan Disselhoff

@JDisselh

3 days

All in all an amazing experience! Huge thanks to the organizers @arcprize and congratulations to the other winning teams and papers! Definitely check them out at

arcprize.org

Prize information, rules, and key dates.

1

0

4

Jan Disselhoff

@JDisselh

3 days

The things above were the things that worked, but of course we had a lot of approaches that did not pan out. Most frustrating one: Dave invested a lot of time into synthetic data generation, which was the approach that the first place NVARC team used! (More examples in blog)

1

0

4

Jan Disselhoff

@JDisselh

3 days

Recurrence seems a popular approach currently, and I think that our final approach is somewhat close to HRM/TRM style solutions. Adding some intermediary reasoning tokens would make them even more similar, and is something we might test in the future.

1

0

2

Jan Disselhoff

@JDisselh

3 days

Even though this is not how the model was trained, it handles this very well and suddenly was able to fix errors and solve problems it previously struggled with! That insight allowed us to increase our score to our final leaderboard score

1

0

4

Jan Disselhoff

@JDisselh

3 days

In the last weeks of the competition, Dave and Daniel had a breakthrough: They stopped using the masked diffusion as such, and did not fully demask tokens! Instead partial solutions are combined with mask tokens and recurrently fed back into the LLM!

1

0

2

Jan Disselhoff

@JDisselh

3 days

The issue was that our approach from last year was incredibly effective at utilizing additional compute, but we had no way of doing the same for the masked diffusion model! The base model was stronger, but did not scale well in inference!

1

0

2

Jan Disselhoff

@JDisselh

3 days

All of this allowed us to build a masked diffusion model that was able to solve ARC tasks. We had amazing performance on the public eval dataset in our tests and then... could not increase our score on the leaderboard...

1

0

2

Jan Disselhoff

@JDisselh

3 days

Additionally we experimented with manipulating positional embeddings, to allow the model a better understanding of the 2D structure (see https://t.co/IEgWUEZJka). This helped, but less than expected. RoPE is surprisingly adaptable even to problems it was not designed to handle.

1

0

2

Jan Disselhoff

@JDisselh

3 days

While there are issues with that in natural language, on ARC this can be of great benefit since we can predict parts of the problem very easily (such as background), and puzzle tasks become much simpler. For issues see for example here: https://t.co/Ez2yqMMZCV

Cunxiao Du

@ducx_du

15 days

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This

1

0

2

Jan Disselhoff

@JDisselh

3 days

However, we were also working on a different approach, using Masked Diffusion LLMs based on LLaDA ( https://t.co/accFw5lW8A). While these models are often cited for their inference speed, we were far more interested in their ability to choose the order in which they unmask tokens!

github.com

Official PyTorch implementation for "Large Language Diffusion Models" - ML-GSAI/LLaDA

1

0

6

Jan Disselhoff

@JDisselh

3 days

Using our old method with some inference optimizations therefore saturated at ~17 points on the public leaderboard, 14.17 points on the private score.

1

0

2

Jan Disselhoff

@JDisselh

3 days

However, we saw that this method had a hard time on ARC-2. Die to the Autoregressive nature, it was easy to make early prediction mistakes that the model is then unable to fix. Our method struggled especially on puzzle and simulation tasks, as well as at predicting diagonals.

1

0

2

Jan Disselhoff

@JDisselh

3 days

When the contest began, we were still working on optimizing our approach from the previous year, which won ARC-2024. It used finetuned LLMs with a custom sampling method and a selection scheme that allowed us to leverage test-time compute very efficiently!

1

0

2