Max Seitzer
@maxseitzer
Followers
650
Following
40
Media
35
Statuses
83
Researcher in the DINO team at Meta FAIR. Before: PhD at Max Planck Institute for Intelligent Systems, Tübingen. Representation learning, agents, structure.
Joined January 2021
Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…
12
141
1K
🚀 Internship Opportunity! DINO team is recruiting 2026 interns at both M2 (Master’s) and PhD levels in Paris office. 🔗 If you are interested, check out the job descriptions and apply here: https://t.co/dHnSLvNxGt
https://t.co/9cKYjdInts
metacareers.com
Meta's mission is to build the future of human connection and the technology that makes it possible.
5
56
445
I'm giving a talk about DINOv3 👇
📢 Save the date! Join us for the next @ELLISforEurope x @unireps Speaker Series! 📅8th October – 16:00 CEST 📍 https://t.co/iHc93nIQJ4 🎙️Speakers: Keynote Talk by @maxseitzer & Flash Talk by @JRaugel
0
5
28
1/ You might have seen it—DINOv3 is out! 🦖🦕In this thread, we share key insights on our Gram anchoring ⚓︎ and how it helps to get smooth feature maps. 👇
9
64
492
Can AI help understand how the brain learns to see the world? Our latest study, led by @JRaugel from FAIR at @AIatMeta and @ENS_ULM, is now out! 📄 https://t.co/y2Y3GP3bI5 🧵 A thread:
31
302
2K
We have not gone much deeper than that, so I think there is more to find out about the role of those dimensions and how to potentially avoid them!
1
0
3
4) These outliers are *different* from high norm tokens studied by the registers paper or by An et al in LLMs ( https://t.co/HTET8SURVB), and are not removed by registers & attention bias. My guess is that they are specific to the DINO setup of different heads & losses.
arxiv.org
Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and...
1
0
4
3) Why? I think they enable mode switching in the final LN, selecting dims for the heads. Before LN, DINO CLS & iBOT mask tokens share many top dims. After LN, the top dims differ completely. The outlier dim is distributed differently for the 2 token types, likely enabling this.
2
0
3
2) We tried different ways to get rid of the outlier dimensions (L2 reg/Linf reg/topk masking), but they either re-emerged, or performance was negatively affected. So indeed they appear critical for the function of the model.
1
0
4
1) These dimensions carry no information in the sense that the channels can be zeroed in the pre-final LN features without drop in downstream metrics (eg linear probing on the ablated features). Of course, applying the final LN to the zeroed features changes the output statistics
1
0
3
Nice investigation! We did study those outlier dimensions a bit for the 7B model, which is summarized in section A.2 of the paper. Some comments:
DINO-v3 has a single high-magnitude channel on its residual pathway, channel 416. Turning off this single channel affects DINO's entire output by 50-80%. For context, turning off a random channel has an effect of less than one percent. The model builds up channel 416 in its last
1
0
16
There are still many more interesting aspects to the paper (it's massive!), so please have a read https://t.co/oAq01hmKqT
arxiv.org
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being...
0
0
4
3) Web ≥ satellite model This is a surprising one! Our main web model outperforms the satellite model on some geospatial tasks 🤯Goes to show the power of massive datasets for generalization. Would be interesting to see if this holds for other domains as well, e.g. microscopy!
1
0
1
2) Minimal performance loss for distilled models We compress the big 7B model into more practical versions like the 840M H+ and 300M L — with minimal loss despite 8-23x reduction in params! The L especially shines on dense tasks relative to its size. Best of both worlds!
1
0
1
This is a result of high resolution adaptation (Sec 5.1)! Before, we saw performance dropping at higher resolutions for dense tasks. After, we get better results with higher resolutions, as it should be.
1
0
1
1) Scaling to extreme resolutions Even though the model is trained at max. 768px inputs, it can handle WAY more than that. Features don’t degrade, they become more crisp! Tested up to 4k. This is a property emerging for the larger models (≥L), see Fig 17.
1
0
2
The DINOv3 paper is now available on arXiv: https://t.co/oAq01hmKqT Have you looked at the paper yet? Here are three observations that might not be immediately obvious from a first read 👇
arxiv.org
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being...
Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…
1
0
9
This figure from the impressive DINOv3 paper is fun to think about. Pretend it's 2018 and you're deciding what research to focus on. Self supervised is <40% and supervised >80%. Would you bet on SSL ever catching up? Some people were believers even then. Have faith!
Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…
7
16
144
hey we heard you liked dinov2 so we got you more of the same shit dinov3 is like dinov2 in the sense that it's much better than the things before rumor has it that plugging dinov3 on your benchmark is a low hanging sota but be quiet im not supposed to tell
Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…
7
12
199
Proud to have contributed to the ground-breaking DINOv3 by reaching the SOTA on COCO Object Detection, for the first time with a frozen SSL backbone, and a lightweight head ! For me, the debate is closed: SSL is the way!
Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense
4
7
60
Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense
346
784
5K