
Enrico Fini
@DonkeyShot21
Followers
1K
Following
785
Media
30
Statuses
229
Member of Technical Staff @MicrosoftAI | Previously RS @Apple MLR, Intern @MetaAI & @amazon | AIMv2, solo-learn, continual pre-training
Zurich, Switzerland
Joined July 2011
We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥 Paper: https://t.co/YpU6T8Pr9p Repo: https://t.co/g1LO5rE5Y0 Model Gallery: https://t.co/j3jZ8TEtf5
6
36
169
Super excited to share l3m 🚀, a library for training large multimodal models, which we used to build AIM and AIMv2. Massive thanks to @alaa_nouby @DonkeyShot21 Michal Klein @MustafaShukor1 @jmsusskind and many others.
1
16
53
Microsoft AI are hiring for Early in Career talent. If you’ve published at any top conferences such as ICLR, neurIPS etc and are working in pre, post training or Multimodal and want to build the worlds most advanced frontier models, DM me and let’s chat! #hiring #microsoftai
2
1
11
Introducing MAI-Voice-1 - most expressive, natural voice generation model I've ever used (might be a bit biased) - super efficient, generating a minute of audio in <1 second on a single GPU - live now in Copilot Daily + Podcasts Try it in Copilot Labs too:
copilot.microsoft.com
Explore Copilot Labs - Microsoft's hub for experimental AI. Try bold AI experiments, co-create with the community, and help shape the future of Copilot
10
11
173
🚨Text Leaderboard Update: A new model provider, @MicrosoftAI has broken into the Top 15 this week! 💠MAI-1-preview by @MicrosoftAI debuts at #13. Congrats to the Microsoft AI team! As the Text Arena is one of the most competitive races, breaking into the Top 15 is no small
Introducing MAI-1-preview - our first foundation model trained end to end in house - in public testing on LMArena - we’re excited to be actively spinning the flywheel to deliver improved models
19
44
309
Our work on scaling laws for multimodal models and MoEs got an Oral at ICCV. Check it out !
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
2
21
141
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
6
49
267
Career update: today I joined Microsoft AI in Zurich 🇨🇭 as a Member of Technical Staff. I’m going to miss my friends and colleagues at Apple MLR, but I’m excited for this new opportunity. LFG 🚀
6
2
215
Apple just broke the scaling laws for image models. Imagine creating Ghibli art, but 10x faster.
21
58
863
Apple just dropped Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
10
54
274
Training and scaling large multimodal models from scratch? This is the thread for you. In this new paper, we provide an extensive study with hundreds of runs, fitting scaling laws for early/late fusion models, MoEs, and exploring different data mixtures. Tons of cool findings.
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
3
11
94
Scaling Laws for Native Multimodal Models - Early-fusion exhibits stronger perf at lower param counts, is more efficient to train, and is easier to deploy, compared w/ late fusion. - Incorporating MoEs allows for models that learn modality-specific weights, significantly
4
78
461
Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: https://t.co/ApWNbE2ZO6. Project Page: https://t.co/MlDKYAfSLz
#FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI
0
7
37
We are having a "Where is Molmo? Where is Qwen?" moment in computer vision
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
0
2
30
FlexTok is pretty novel dynamic length image tokenizer, I will be speedrunning training one today (8:30 AM EST) at https://t.co/XNJ9147oCB, which is roughly in 3 hours
13
34
425
Happy to see the Ovis2 multimodal LLMs leveraging our AIMv2 encoders and achieving impressive results, congrats to the team at @AI_AlibabaInt!
Ovis2-34B has achieved remarkable results on the multimodal leaderboards! 🏆 #1 in open-source MLLMs - Multimodal Reasoning (47.9) 📊 #2 in open-source MLLMs - Academic (76.5) Thanks to everyone who contributed to this achievement🎉 #AI #MachineLearning #MLLM
0
3
27
Check out what the team has been cooking! 🍳🔥 Awesome work lead by @roman__bachmann @JRAllardice @dmizrahi_
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. https://t.co/17oJKymhPl
https://t.co/5vSqDxjwFN 🧵 1/n
0
0
8
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 https://t.co/b1uuyJwzRF
arxiv.org
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks...
12
150
1K