Yinfei Yang @yinfeiy X Profile

Yinfei Yang

@yinfeiy

Followers

468

Following

129

Media

2

Statuses

65

https://t.co/aHLfYCOCej

Joined August 2010

Don't wanna be here? Send us removal request.

Zhe Gan

@zhegan4

27 days

🎁🎁 We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. 🔗 Data link: https://t.co/mi06ddf3mN 🔗Paper link: https://t.co/AaZM02xcJr It includes 258K single-turn image editing data, 72K multi-turn

8

119

769

Yinfei Yang

@yinfeiy

26 days

Our AToken is release, please check it out !

Jiasen Lu

@jiasenlu

27 days

🚀Our Code and Weight is public available. We provide detailed examples, and try out at https://t.co/0jE0CrSTLY

0

1

Zhe Gan

@zhegan4

2 months

🤔 Small GUI agents can be important for on-device deployment, but how far can such small agents go? 💡 Instead of scaling up, in Ferret-UI Lite, we focused on scaling down, and present our lessons from building small on-device GUI agents. 🔗 arXiv: https://t.co/kOdfIp2Hde

1

8

19

lyttonhao

@lyttonhao

2 months

Excited to share Manzano from AFM team—a simple, scalable unified multimodal model for understanding and generation. Manzano shows minimal task conflict, promising scaling behavior and state-of-the-art results among unified models. Paper link: https://t.co/HpziryrvSc

1

8

16

Jiasen Lu

@jiasenlu

2 months

Vision tokenizers are stuck in 2020🤔while language models revolutionized AI🚀 Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND

Aran Komatsuzaki

@arankomatsuzaki

2 months

Apple presents AToken: A unified visual tokenizer • First tokenizer unifying images, videos & 3D • Shared 4D latent space (preserves both reconstruction & semantics) • Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)

6

72

376

Erik Daxberger

@edaxberger

8 months

Check out our new work on exploring 3D Spatial Understanding with Multimodal LLMs!🚀 📀CA-VQA: A fine-tuning dataset and benchmark w/ various input signals and spatial tasks. 🤖MM-Spatial: A generalist MLLM excelling at spatial reasoning. 🔗 https://t.co/2sByAvJJ0j 🧵(1/n)

3

4

11

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

8 months

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation 

4

38

195

Zongyu Lin

@zy27962986

11 months

🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation models🔥🔥🔥. In this work, we aim at providing a transparent and detailed recipe 📖 for model architecture, training

9

46

131

Yinfei Yang

@yinfeiy

11 months

🚀🚀🚀 Check out our latest text-image conditioned video gen model STIV. Congrats @zy27962986 and @WeiLiu19963978 for the great work.

Zongyu Lin

@zy27962986

11 months

🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation models🔥🔥🔥. In this work, we aim at providing a transparent and detailed recipe 📖 for model architecture, training

0

9

Ruohong Zhang

@RuohongZhang

1 year

[p1] Improve Visual Language Model Chain-of-thought Reasoning paper link: https://t.co/eUnlisUsv5 project page (to be updated upon approval on release): https://t.co/LpAYt6k8yQ Content: 1. We distill 193K CoT data 2. Train with SFT 3. DPO to futher improve performance

3

38

215

Zhe Gan

@zhegan4

1 year

💡CLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info. 🤩🤩 We present CLOC ⏰, our next-generation image encoder, with enhanced localization capabilities, and

15

154

930

Yinfei Yang

@yinfeiy

1 year

Excited to share the MM1.5 work from our team. Thanks to the team for the great work 🚀🚀🚀

Zhe Gan

@zhegan4

1 year

🚀🚀 Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning.

0

10

30

Yinfei Yang

@yinfeiy

2 years

🚀🚀 Ferret-v2 is here. Checkout the latest work from our group. A new design of referring and grounding MLLM with any resolution input. Significantly improved performance on original Ferret and other baselines. Work lead by @HaotianZhang4AI @XyouH @zhegan4

Zhe Gan

@zhegan4

2 years

🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.

1

6

23

Haotian Zhang

@HaotianZhang4AI

2 years

🚀🚀🚀 Introducing Ferret-v2, a significant upgrade to Ferret that enhances its detailed visual perception ability. With features like any-resolution referring & grounding, multi-granularity visual encoding, and a three-stage training paradigm, Ferret-v2 sets a new standard.

Zhe Gan

@zhegan4

2 years

🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.

2

5

22

AK

@_akhaliq

2 years

Apple presents Ferret-v2 An Improved Baseline for Referring and Grounding with Large Language Models While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain

3

58

379

AK

@_akhaliq

2 years

Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with

31

385

2K

Brandon McKinzie

@mckbrando

2 years

Thrilled to share MM1!. The MM1 series of models are competitive with Gemini 1 at each of their respective model sizes. Beyond just announcing a new series of models, we also share the ablation results that guided our research process (🧵).

12

90

393

Yinfei Yang

@yinfeiy

2 years

VeCLIP led by @JeffLaiZF @HaotianZhang4AI and @bowen_zhang MOFI led by Wentao wu and Aleksei Timofeev Models are trained using AXLearn ( https://t.co/JbpCUIpjDH)

0

1

Yinfei Yang

@yinfeiy

2 years

MOFI is designed as another foundation embedding embedding model good for image-to-image search and achieves SOTA performance on GPR1200 benchmark. Learned from large scale noisy entity annotated images. Will be presented at ICLR 2024.

0

3

Yinfei Yang

@yinfeiy

2 years

It is complementary to other well curated image-text datasets, and achieve SoTA level performance on both of text-image retrieval and ImageNet classification (83.07%) benchmarks when combining both . Providing yet another choice for downstream tasks.

0

3