Yinfei Yang
@yinfeiy
Followers
468
Following
129
Media
2
Statuses
65
ππ We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. π Data link: https://t.co/mi06ddf3mN πPaper link: https://t.co/AaZM02xcJr It includes 258K single-turn image editing data, 72K multi-turn
8
119
769
Our AToken is release, please check it out !
πOur Code and Weight is public available. We provide detailed examples, and try out at https://t.co/0jE0CrSTLY
0
0
1
π€ Small GUI agents can be important for on-device deployment, but how far can such small agents go? π‘ Instead of scaling up, in Ferret-UI Lite, we focused on scaling down, and present our lessons from building small on-device GUI agents. π arXiv: https://t.co/kOdfIp2Hde
1
8
19
Excited to share Manzano from AFM teamβa simple, scalable unified multimodal model for understanding and generation. Manzano shows minimal task conflict, promising scaling behavior and state-of-the-art results among unified models. Paper link: https://t.co/HpziryrvSc
1
8
16
Vision tokenizers are stuck in 2020π€while language models revolutionized AIπ Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND
Apple presents AToken: A unified visual tokenizer β’ First tokenizer unifying images, videos & 3D β’ Shared 4D latent space (preserves both reconstruction & semantics) β’ Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
6
72
376
Check out our new work on exploring 3D Spatial Understanding with Multimodal LLMs!π πCA-VQA: A fine-tuning dataset and benchmark w/ various input signals and spatial tasks. π€MM-Spatial: A generalist MLLM excelling at spatial reasoning. π https://t.co/2sByAvJJ0j π§΅(1/n)
3
4
11
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation ο£Ώ
4
38
195
πππWant to develop a cutting-edge video generation model towards Sora? Please dive into Appleβs latest recipe and studies for scalable video generation modelsπ₯π₯π₯. In this work, we aim at providing a transparent and detailed recipe π for model architecture, training
9
46
131
πππ Check out our latest text-image conditioned video gen model STIV. Congrats @zy27962986 and @WeiLiu19963978 for the great work.
πππWant to develop a cutting-edge video generation model towards Sora? Please dive into Appleβs latest recipe and studies for scalable video generation modelsπ₯π₯π₯. In this work, we aim at providing a transparent and detailed recipe π for model architecture, training
0
0
9
[p1] Improve Visual Language Model Chain-of-thought Reasoning paper link: https://t.co/eUnlisUsv5 project page (to be updated upon approval on release): https://t.co/LpAYt6k8yQ Content: 1. We distill 193K CoT data 2. Train with SFT 3. DPO to futher improve performance
3
38
215
π‘CLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info. π€©π€© We present CLOC β°, our next-generation image encoder, with enhanced localization capabilities, and
15
154
930
Excited to share the MM1.5 work from our team. Thanks to the team for the great work πππ
ππ Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning.
0
10
30
ππ Ferret-v2 is here. Checkout the latest work from our group. A new design of referring and grounding MLLM with any resolution input. Significantly improved performance on original Ferret and other baselines. Work lead by @HaotianZhang4AI @XyouH @zhegan4
πBesides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
1
6
23
πππ Introducing Ferret-v2, a significant upgrade to Ferret that enhances its detailed visual perception ability. With features like any-resolution referring & grounding, multi-granularity visual encoding, and a three-stage training paradigm, Ferret-v2 sets a new standard.
πBesides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
2
5
22
Apple presents Ferret-v2 An Improved Baseline for Referring and Grounding with Large Language Models While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain
3
58
379
Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
31
385
2K
Thrilled to share MM1!. The MM1 series of models are competitive with Gemini 1 at each of their respective model sizes. Beyond just announcing a new series of models, we also share the ablation results that guided our research process (π§΅).
12
90
393
VeCLIP led by @JeffLaiZF @HaotianZhang4AI and @bowen_zhang MOFI led by Wentao wu and Aleksei Timofeev Models are trained using AXLearn ( https://t.co/JbpCUIpjDH)
0
0
1
MOFI is designed as another foundation embedding embedding model good for image-to-image search and achieves SOTA performance on GPR1200 benchmark. Learned from large scale noisy entity annotated images. Will be presented at ICLR 2024.
0
0
3
It is complementary to other well curated image-text datasets, and achieve SoTA level performance on both of text-image retrieval and ImageNet classification (83.07%) benchmarks when combining both . Providing yet another choice for downstream tasks.
0
0
3