Michael Tschannen Profile
Michael Tschannen

@mtschannen

Followers
3K
Following
673
Media
23
Statuses
245

Research Scientist @GoogleDeepMind. Representation learning for multimodal understanding and generation. Personal account.

Zurich, Switzerland
Joined June 2012
Don't wanna be here? Send us removal request.
@mtschannen
Michael Tschannen
8 months
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)?. We have been pondering this during summer and developed a new model: JetFormer 🌊🤖. A thread 👇. 1/
Tweet media one
15
142
839
@mtschannen
Michael Tschannen
2 months
📢 We just released the code for JetFormer at. Enjoy!.
@mtschannen
Michael Tschannen
8 months
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)?. We have been pondering this during summer and developed a new model: JetFormer 🌊🤖. A thread 👇. 1/
Tweet media one
5
61
313
@mtschannen
Michael Tschannen
3 months
RT @MarioLucic_: Massive advancements in video understanding with Gemini 2.5! ✨ Unlock new capabilities to process hours of video, summariz….
0
2
0
@mtschannen
Michael Tschannen
3 months
We are presenting JetFormer at ICLR this morning, poster #190. Stop by if you’re interested in unified multimodal architectures!.
@mtschannen
Michael Tschannen
8 months
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)?. We have been pondering this during summer and developed a new model: JetFormer 🌊🤖. A thread 👇. 1/
Tweet media one
6
30
223
@mtschannen
Michael Tschannen
4 months
Bert is reading a free-form interpretation of
0
1
9
@mtschannen
Michael Tschannen
4 months
4o native image generation is confirmed to be some sort of autoregressive model. Maybe this is a good moment for AR skeptics to catch up on the recent literature on multimodal AR models.
Tweet media one
7
35
268
@mtschannen
Michael Tschannen
4 months
RT @GoogleDeepMind: Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimen….
0
520
0
@mtschannen
Michael Tschannen
5 months
RT @osanseviero: I’m so happy to announce Gemma 3 is out! 🚀. 🌏Understands over 140 languages.👀Multimodal with image and video input.🤯LMAren….
0
444
0
@mtschannen
Michael Tschannen
5 months
RT @andrefaraujo: Excited to release a super capable family of image-text models from our TIPS #ICLR2025 paper! We….
Tweet card summary image
github.com
Contribute to google-deepmind/tips development by creating an account on GitHub.
0
6
0
@mtschannen
Michael Tschannen
5 months
RT @giffmana: SigLIP2 + Aya = really good multilingual VLMs!.
0
25
0
@mtschannen
Michael Tschannen
5 months
RT @visheratin: SigLIP2 is indeed a better encoder than SigLIP! Over the last two weekends, I trained a new SOTA multilingual model - mexma….
0
20
0
@mtschannen
Michael Tschannen
5 months
RT @borisdayma: Caught up on the new SigLip 2 paper 🤓. Cool things I learnt from it:. - finally using patch size of 16 with size multiple o….
0
8
0
@mtschannen
Michael Tschannen
5 months
HF model collection for transformers:. HF model collection for OpenCLIP and timm:. And of course big_vision checkpoints:.
Tweet card summary image
github.com
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more. - google-research/big_vision
0
1
6
@mtschannen
Michael Tschannen
5 months
Paper:. HF blog post from @ariG23498 et al. with a gentle intro to the training recipe and a demo:. Thread with results overview from @XiaohuaZhai:.
Tweet card summary image
huggingface.co
@XiaohuaZhai
Xiaohua Zhai
5 months
Introducing SigLIP2: now trained with additional captioning and self-supervised losses!. Stronger everywhere: .- multilingual.- cls. / ret. - localization.- ocr.- captioning / vqa. Try it out, backward compatible!. Models: Paper:
Tweet media one
1
1
9
@mtschannen
Michael Tschannen
5 months
📢2⃣ Yesterday we released SigLIP 2! . TL;DR: Improved high-level semantics, localization, dense features, and multilingual capabilities via drop-in replacement for v1. Bonus: Variants supporting native aspect and variable sequence length. A thread with interesting resources👇
Tweet media one
5
34
169
@mtschannen
Michael Tschannen
5 months
RT @_akhaliq: Google just dropped SigLIP 2. Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and D….
0
68
0
@mtschannen
Michael Tschannen
5 months
RT @ariG23498: Did @Google just release a better version of SigLIP?. SigLIP 2 is out on Hugging Face!. A new family of multilingual vision-….
0
20
0
@mtschannen
Michael Tschannen
5 months
RT @mervenoyann: SigLIP 2 is the most powerful image-text encoder. you can use it to do.> image-to-image search.> text-to-image-search.> im….
0
79
0
@mtschannen
Michael Tschannen
5 months
RT @iScienceLuvr: This is exciting. SigLIP was one of the best, highly performant image encoders used for a variety of applications, so….
0
24
0
@mtschannen
Michael Tschannen
5 months
RT @arankomatsuzaki: Google presents:. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization,….
0
67
0
@mtschannen
Michael Tschannen
5 months
RT @googleaidevs: PaliGemma 2 mix is an upgraded vision-language model that supports image captioning, OCR, image Q&A, object detection, an….
0
110
0