Ferjad Naeem
@ferjadnaeem
Followers
944
Following
1K
Media
11
Statuses
363
Research Scientist @Google
Zürich, Switzerland
Joined May 2010
First Gemini release with a piece of my work inside 😄 and all the countless other amazing people 🚀🚀🚀
Our first release is Gemini 3 Pro, which is rolling out globally starting today. It significantly outperforms 2.5 Pro across the board: 🥇 Tops LMArena and WebDev @arena leaderboards 🧠 PhD-level reasoning on Humanity’s Last Exam 📋 Leads long-horizon planning on Vending-Bench 2
1
0
6
🚀 Excited to share our new work RefAM: Attention Magnets for Zero-Shot Referral Segmentation, a training-free approach that turns diffusion model attentions into segmentations. By @anna_kukleva_, me, Alessio Tonioni, @ferjadnaeem, @fedassa, @janericlenssen, Bernt Schiele
1
3
7
🎺Meet VIST3A — Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator. ➡️ Paper: https://t.co/sFqbbUiGOO ➡️ Website: https://t.co/QWMLwXyVcB Collaboration between ETH & Google with Hyojun Go, @DNarnhofer, Goutam Bhat, @fedassa, and Konrad Schindler.
Want to leverage the power of SOTA 3D models like VGGT & Video LDMs for 3D generation? Now you can! 🚀 Introducing VIST3A — we stitch pretrained video generators to 3D foundation models and align them via reward finetuning. 📄 https://t.co/MctMyuDev4 🌐
2
11
88
Oguzhan is an amazing mentor to work with, apply if you are on the internship market
🚨 Research Internship opportunity at Apple We’re looking for interns to push the limits of multimodal AI agents! 📍 Santa Clara Valley 🇺🇸 & Zurich 🇨🇭 🗓️ Start: asap Send CV + representative work to mint-agent-internship@group.apple.com Also apply:
0
0
7
A big congratulations to the whole Gemini team on pushing this amazing family of models out 😄 Our tech report is out now: https://t.co/5FfTM1LEdN Feels a bit unreal to share the contributors list with all the amazing colleagues
Hot Gemini updates off the press. 🚀 Anyone can now use 2.5 Flash and Pro to build and scale production-ready AI applications. 🙌 We’re also launching 2.5 Flash-Lite in preview: the fastest model in the 2.5 family to respond to requests, with the lowest cost too. 🧵
0
1
11
Active Data Curation Effectively Distills Large-Scale Multimodal Models - compute per sample loss with large batch - only backprop (probabilistically) through samples with high loss intuition: these are the samples where there is “something to learn” - if both teacher and
0
4
13
Stop by this amazing work from Vishaal and the team today at CVPR
Our ACID paper showing how you can use active data curation as an effective way to pretrain super-strong smol and efficient VL-encoders. Poster #361 in the Poster Hall from 10:30 AM - 12:30 PM on Saturday, 14th June https://t.co/LiKMgruXmP
0
1
9
timm's got a new vision transformer (NaFlexVit), and it's flexible! I've been plugging away at this for a bit, integrating ideas from FlexiViT, NaViT, and NaFlex and finally ready to merge for initial exploration. The model supports: * variable aspect/size images of NaFlex (see
5
38
234
At #GoogleIO, we shared how decades of AI research have now become reality. From a total reimagining of Search to Agent Mode, Veo 3 and more, Gemini season will be the most exciting era of AI yet. Some highlights 🧵
269
2K
14K
📢 We just released the code for JetFormer at https://t.co/Wgiz3tK9S8 Enjoy!
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 https://t.co/ngvPzZvUYW A thread 👇 1/
5
61
310
We are presenting JetFormer at ICLR this morning, poster #190. Stop by if you’re interested in unified multimodal architectures!
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 https://t.co/ngvPzZvUYW A thread 👇 1/
6
31
226
Google's global PhD Fellowship program will open for applications this week! (on Apr 10th) This supports PhD students in computer science and related fields, also connecting to a Google mentor. Learn more and apply at: https://t.co/ynVQDf5xLi (deadline: May 15th, 2025)
research.google
0
1
5
Check out the strongest open-source dense prediction models from our colleagues!
📢📢 We released checkpoints and Pytorch/Jax code for TIPS: https://t.co/0JUIRML8gr Paper updated with distilled models, and more: https://t.co/zebYMD0VFz
#ICLR2025
0
0
5
The majority of features in this layer of Siglip-2 are multimodal. I'd expected some multimodality but was surprised that two-thirds of the neurons I tested bind together their visual and linguistic features. This neuron fires for images of mustaches and for the word "mustache"
8
25
295
Fully supportive of this. Machine Learning/ Computer Vision review process is broken with irresponsible reviewers. Glad to see there is some accountability.
#CVPR2025 Area Chairs (ACs) identified a number of highly irresponsible reviewers, those who either abandoned the review process entirely or submitted egregiously low-quality reviews, including some generated by large language models (LLMs). 1/2
0
0
5
Delighted to share that ACED has been accepted at CVPR2025! Check out our work to know how to distill the strongest smol size image-text contrastive models.
Check out our latest work that explores data curation as a paradigm to learn compute efficient image text contrastive models Had a blast collaborating across Google, Deepmind, Tuebingen and Cambridge on this work
0
0
16
📢2⃣ Yesterday we released SigLIP 2! TL;DR: Improved high-level semantics, localization, dense features, and multilingual capabilities via drop-in replacement for v1. Bonus: Variants supporting native aspect and variable sequence length. A thread with interesting resources👇
5
34
171
Excited to share what we have been up to in image text embedding models. SigLIP 2 is the most powerful encoder for most open vocabulary computer vision and MMLLM tasks. Checkpoints are open sourced and we look forward to what the community achieves with these.
Introducing SigLIP2: now trained with additional captioning and self-supervised losses! Stronger everywhere: - multilingual - cls. / ret. - localization - ocr - captioning / vqa Try it out, backward compatible! Models: https://t.co/3hOdqcy9QD Paper: https://t.co/Tp4D8Syld8
2
4
48
Excited to see our paper "Tokenformer: Rethinking transformer scaling with tokenized model parameters" accepted as a spotlight at #ICLR2025 ! Hope our idea of tokenizing everything can inspire the future of AI. Paper: https://t.co/0ofQpsudSH Code: https://t.co/D3rhZzqMwD
3
45
270
Ever thought of training multimodal models with 100 billion 🚀 unique examples? Check out WebLI-100B! The study reveals exciting insights on long-tail tasks, including multilingual and cultural diversity benchmarks. Paper: https://t.co/npNrvPGY53
4
18
159