
Sophia Sirko-Galouchenko
@sophia_sirko
Followers
62
Following
28
Media
5
Statuses
14
PhD student in visual representation learning at https://t.co/WkgoTYyOly and Sorbonne Université (MLIA)
Paris, France
Joined April 2015
1/n 🚀New paper out - accepted at @ICCVConference! Introducing DIP: unsupervised post-training that enhances dense features in pretrained ViTs for dense in-context scene understanding Below: Low-shot in-context semantic segmentation examples. DIP features outperform DINOv2!
2
26
120
The PhD graduation season in the team goes on! Today Corentin Sautier is defending his PhD on "Learning Actionable LiDAR Representations without Annotations". Good luck! 🚀
Another great event for @valeoai: a PhD defense of Corentin Sautier. His thesis «Learning Actionable LiDAR Representations w/o Annotations» covers the papers BEVContrast (learning self-sup LiDAR features), SLidR, ScaLR (distillation), UNIT and Alpine (solving tasks w/o labels).
2
2
15
It’s PhD graduation season in the team! Today, @Bjoern_Michele is defending his PhD on "Domain Adaptation for 3D Data" Best of luck! 🚀
1
5
20
Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research🧵
13
57
274
1/ New & old work on self-supervised representation learning (SSL) with ViTs: MOCA ☕ - Predicting Masked Online Codebook Assignments w/ @SpyrosGidaris @oriane_simeoni @AVobecky @quobbe N. Komodakis, P. Pérez #TMLR #ICLR2025 Grab a ☕ and brace for a story & a 🧵
1
14
48
New paper out - accepted at @ICCVConference We introduce MoSiC, a self-supervised learning framework that learns temporally consistent representations from video using motion cues. Key idea: leverage long-range point tracks to enforce dense feature coherence across time.🧵
2
24
129
Work done in collaboration with @SpyrosGidaris @AVobecky @abursuc @thomenicolas1 Paper: https://t.co/5JYX9NuWrd Github: https://t.co/irZI8BMYF4
#ICCV2025
github.com
Official implementation of DIP: Unsupervised Dense In-Context Post-training of Visual Representations - sirkosophia/DIP
0
1
9
6/n Benefits 💪 - < 9h on a single A100 gpu. - Improves across 6 segmentation benchmarks - Boosts performance for in-context depth prediction. - Plug-and-play for different ViTs: DINOv2, CLIP, MAE. - Robust in low-shot and domain shift.
1
0
6
5/n Why is DIP unsupervised? DIP doesn't require manually annotated segmentation masks for its post-training. To accomplish this, it leverages Stable Diffusion (via DiffCut) alongside DINOv2R features to automatically construct in-context pseudo-tasks for its post-training.
1
1
4
4/n Meet Dense In-context Post-training (DIP)! 🔄 - Meta-learning inspired: adopts episodic training principles. - Task-aligned: Explicitly mimics downstream dense in-context tasks during post-training. - Purpose-built: Optimizes the model for dense in-context performance.
1
0
5
3/n Most unsupervised (post-)training methods for dense in-context scene understanding rely on self-distillation frameworks with (somewhat) complicated objectives and network components. Hard to interpret, tricky to tune. Is there a simpler alternative? 👀
1
0
5
2/n What is dense in-context scene understanding? Formulate dense prediction tasks as nearest-neighbor retrieval problems using patch feature similarities between query and the labeled prompt images (introduced in @ibalazevic et al.’s HummingBird; figure below from their work).
1
0
5
You want to give audio abilities to your VLM without compromising its vision performance? You want to align your audio encoder with a pretrained image encoder without suffering from the modality gap? Check our #NeurIPS2024 paper with @michelolzam @Steph_lat and Slim Essid
1
3
19
The preprint of our work (with @salah_zaiem and @AlgayresR) on sample dependent ASR model selection is available on arXiv! In this paper we propose to train a decision module, that allows, given an audio sample, to use the smallest sufficient model leading to a good transcription
1
4
11