Max Seitzer @maxseitzer X Profile

Max Seitzer

@maxseitzer

Followers

650

Following

40

Media

35

Statuses

83

Researcher in the DINO team at Meta FAIR. Before: PhD at Max Planck Institute for Intelligent Systems, Tübingen. Representation learning, agents, structure.

Joined January 2021

Don't wanna be here? Send us removal request.

Max Seitzer

@maxseitzer

3 months

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…

12

141

1K

Oriane Siméoni

@oriane_simeoni

24 days

🚀 Internship Opportunity! DINO team is recruiting 2026 interns at both M2 (Master’s) and PhD levels in Paris office. 🔗 If you are interested, check out the job descriptions and apply here: https://t.co/dHnSLvNxGt https://t.co/9cKYjdInts

metacareers.com

Meta's mission is to build the future of human connection and the technology that makes it possible.

5

56

445

Max Seitzer

@maxseitzer

1 month

I'm giving a talk about DINOv3 👇

UniReps

@unireps

1 month

📢 Save the date! Join us for the next @ELLISforEurope x @unireps Speaker Series! 📅8th October – 16:00 CEST 📍 https://t.co/iHc93nIQJ4 🎙️Speakers: Keynote Talk by @maxseitzer & Flash Talk by @JRaugel

0

5

28

Oriane Siméoni

@oriane_simeoni

2 months

1/ You might have seen it—DINOv3 is out! 🦖🦕In this thread, we share key insights on our Gram anchoring ⚓︎ and how it helps to get smooth feature maps. 👇

9

64

492

Jean-Rémi King

@JeanRemiKing

2 months

Can AI help understand how the brain learns to see the world? Our latest study, led by @JRaugel from FAIR at @AIatMeta and @ENS_ULM, is now out! 📄 https://t.co/y2Y3GP3bI5 🧵 A thread:

31

302

2K

Max Seitzer

@maxseitzer

2 months

We have not gone much deeper than that, so I think there is more to find out about the role of those dimensions and how to potentially avoid them!

1

0

3

Max Seitzer

@maxseitzer

2 months

4) These outliers are *different* from high norm tokens studied by the registers paper or by An et al in LLMs ( https://t.co/HTET8SURVB), and are not removed by registers & attention bias. My guess is that they are specific to the DINO setup of different heads & losses.

arxiv.org

Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and...

1

0

4

Max Seitzer

@maxseitzer

2 months

3) Why? I think they enable mode switching in the final LN, selecting dims for the heads. Before LN, DINO CLS & iBOT mask tokens share many top dims. After LN, the top dims differ completely. The outlier dim is distributed differently for the 2 token types, likely enabling this.

2

0

3

Max Seitzer

@maxseitzer

2 months

2) We tried different ways to get rid of the outlier dimensions (L2 reg/Linf reg/topk masking), but they either re-emerged, or performance was negatively affected. So indeed they appear critical for the function of the model.

1

0

4

Max Seitzer

@maxseitzer

2 months

1) These dimensions carry no information in the sense that the channels can be zeroed in the pre-final LN features without drop in downstream metrics (eg linear probing on the ablated features). Of course, applying the final LN to the zeroed features changes the output statistics

1

0

3

Max Seitzer

@maxseitzer

2 months

Nice investigation! We did study those outlier dimensions a bit for the 7B model, which is summarized in section A.2 of the paper. Some comments:

Rudy Gilman

@rgilman33

2 months

DINO-v3 has a single high-magnitude channel on its residual pathway, channel 416. Turning off this single channel affects DINO's entire output by 50-80%. For context, turning off a random channel has an effect of less than one percent. The model builds up channel 416 in its last

1

0

16

Max Seitzer

@maxseitzer

3 months

There are still many more interesting aspects to the paper (it's massive!), so please have a read https://t.co/oAq01hmKqT

arxiv.org

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being...

0

4

Max Seitzer

@maxseitzer

3 months

3) Web ≥ satellite model This is a surprising one! Our main web model outperforms the satellite model on some geospatial tasks 🤯Goes to show the power of massive datasets for generalization. Would be interesting to see if this holds for other domains as well, e.g. microscopy!

1

0

1

Max Seitzer

@maxseitzer

3 months

2) Minimal performance loss for distilled models We compress the big 7B model into more practical versions like the 840M H+ and 300M L — with minimal loss despite 8-23x reduction in params! The L especially shines on dense tasks relative to its size. Best of both worlds!

1

0

1

Max Seitzer

@maxseitzer

3 months

This is a result of high resolution adaptation (Sec 5.1)! Before, we saw performance dropping at higher resolutions for dense tasks. After, we get better results with higher resolutions, as it should be.

1

0

1

Max Seitzer

@maxseitzer

3 months

1) Scaling to extreme resolutions Even though the model is trained at max. 768px inputs, it can handle WAY more than that. Features don’t degrade, they become more crisp! Tested up to 4k. This is a property emerging for the larger models (≥L), see Fig 17.

1

0

2

Max Seitzer

@maxseitzer

3 months

The DINOv3 paper is now available on arXiv: https://t.co/oAq01hmKqT Have you looked at the paper yet? Here are three observations that might not be immediately obvious from a first read 👇

arxiv.org

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being...

Max Seitzer

@maxseitzer

3 months

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…

1

0

9

Ali Eslami

@arkitus

3 months

This figure from the impressive DINOv3 paper is fun to think about. Pretend it's 2018 and you're deciding what research to focus on. Self supervised is <40% and supervised >80%. Would you bet on SSL ever catching up? Some people were believers even then. Have faith!

Max Seitzer

@maxseitzer

3 months

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…

7

16

144

TimDarcet

@TimDarcet

3 months

hey we heard you liked dinov2 so we got you more of the same shit dinov3 is like dinov2 in the sense that it's much better than the things before rumor has it that plugging dinov3 on your benchmark is a low hanging sota but be quiet im not supposed to tell

Max Seitzer

@maxseitzer

3 months

Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…

7

12

199

Marc Szafraniec

@MarcSzafraniec

3 months

Proud to have contributed to the ground-breaking DINOv3 by reaching the SOTA on COCO Object Detection, for the first time with a frozen SSL backbone, and a lightweight head ! For me, the debate is closed: SSL is the way!

AI at Meta

@AIatMeta

3 months

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense

4

7

60

AI at Meta

@AIatMeta

3 months

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense

346

784

5K