Théophane Vallaeys
@webalorn
Followers
105
Following
196
Media
14
Statuses
56
PhD student @MetaAI (FAIR Paris) and Sorbonne University | Graduated from ENS | Into generative image modeling
Paris, France
Joined May 2023
🎆 Can we achieve high compression rate for images in autoencoders without compromising quality and decoding speed? ⚡️ We introduce SSDD (Single-Step Diffusion Decoder), achieving improvements on both fonts, setting new state-of-the-art on image reconstruction. 👇 1/N
5
34
169
Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: https://t.co/tIwymSSD89 2️⃣ SAM 3D
107
589
4K
🚨New Paper @AIatMeta 🚨 You want to train a largely multilingual model, but languages keep interfering and you can’t boost performance? Using a dense model is suboptimal when mixing many languages, so what can you do? You can use our new architecture Mixture of Languages! 🧵1/n
3
11
22
Text-to-Image models don't need 3 training stages anymore! 🤯 Our new MIRO method integrates human alignment directly into pretraining. 19x faster convergence ⚡ 370x less compute than FLUX-dev 📉 Train once, align to many rewards. The era of multi-stage training is over!
1
14
31
This work also showcases the use of diffusion decoders to reconstruct from semantic embeddings, as these decoders are able to "fill out" the details. This is an application of our SSDD model:
🎆 Can we achieve high compression rate for images in autoencoders without compromising quality and decoding speed? ⚡️ We introduce SSDD (Single-Step Diffusion Decoder), achieving improvements on both fonts, setting new state-of-the-art on image reconstruction. 👇 1/N
0
0
1
This new work we published, lead by Xiangyi Chen, shows how we can extract downsampled embeddings from semantic encoders. They outperform other types of encoders for both generation and understanding!
Why add REPA when you can be explicit and use the VLM representation to generate? 🤔 We found the semantic encoder already has the right priors. Train it to sample in its native latent space + lightweight pixel decoder = unified vision model. But naively using the semantic
1
2
6
Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow
7
83
412
⚙️ Last but not least, we make our training code available, enabling use in any downstream applications. Arxiv: https://t.co/98VFhYn57r Code: https://t.co/HbgGaefBKe ✔️ N/N
github.com
Official implementation for SSDD Single-Step Diffusion Decoder for Efficient Image Tokenization. - facebookresearch/SSDD
0
1
9
We highlight that diffusion decoders exhibit diversity in reconstructions, focused around meaningful details of the image. Single-step distillation doesn’t hinder diversity, keeping model behavior intact. 👇 12/N
1
0
6
High spatial downsampling matters: reduce the number of tokens improves the latent generation speed. ↕️ Diffusion decoders generalize better at different spatial downsampling factors, keeping similar reconstruction quality when the total compression rate is constant. 👇 11/N
1
0
7
🤔 We analyze the sampling dynamics: more steps ≠ higher quality. 🔎 This behavior comes from perceptual loss, increasing diversity until the model overshoots the distribution. 👉 We consider sampling as behavior selection, which we distill in a single-step model. 👇 10/N
1
0
6
To enable switching between decoders seamlessly, we show that we can train a single shared encoder per compression rate using a simple data augmentation. 👉 This creates a single shared latent space, where a single latent diffusion model can be paired with any decoder. 👇 9/N
1
0
6
📊 We scale our model family from S (⚡️faster than KL-VAE) to H (achieving higher gains on high-compression settings), enabling downstream applications to choose the trade-off between decoding speed, quality, and compression rate. 👇 8/N
1
0
5
👉 Applying our decoder on image-generation, we show improvements in image quality across all settings. 🔥 Additionally, the modeling capabilities of SSDD can be used to achieve higher compression rates, keeping image quality while greatly reducing generation latency. 👇 7/N
1
0
6
❌ SSDD training method is GAN-free: we show that, opposed to existing deterministic or diffusion decoders, adversarial loss does not bring any perceptible quality improvement. 👉 This enables easier scaling and even more stable training. 👇 6/N
1
0
6
🔥 We show that using a Flow-Matching loss (for latent modeling), LPIPS (for perceptual alignment), and REPA (for internal model features alignment), SSDD reconstructs high-quality images. 👇 5/N
1
0
7
🤔 Existing diffusion decoders are based on U-Net, lacking in modeling capabilities. 👉 To bridge the gap between pixel-space and latent modeling, we adapt the U-ViT (from Simpler Diffusion), achieving superior reconstruction performance at similar parameter count. 👇 4/N
2
0
9
Main take-away: 👉 We train an iterative diffusion decoder for state-of-the-art image reconstruction. ✨ We can then distill it into a ⚡️fast single-step decoder, keeping reconstruction quality and diversity. 👇 3/N
1
0
7
🔎 High compression rates are used to train fast latent diffusion models. 🤔 But the reconstruction decoder becomes the bottleneck by suppressing image details. 👉 We propose to alleviate this by explicitly modeling the distribution of missing information 👇 2/N
1
0
9