Yinfei Yang Profile
Yinfei Yang

@yinfeiy

Followers
468
Following
129
Media
2
Statuses
65

Joined August 2010
Don't wanna be here? Send us removal request.
@zhegan4
Zhe Gan
27 days
🎁🎁 We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. πŸ”— Data link: https://t.co/mi06ddf3mN πŸ”—Paper link: https://t.co/AaZM02xcJr It includes 258K single-turn image editing data, 72K multi-turn
8
119
769
@yinfeiy
Yinfei Yang
26 days
Our AToken is release, please check it out !
@jiasenlu
Jiasen Lu
27 days
πŸš€Our Code and Weight is public available. We provide detailed examples, and try out at https://t.co/0jE0CrSTLY
0
0
1
@zhegan4
Zhe Gan
2 months
πŸ€” Small GUI agents can be important for on-device deployment, but how far can such small agents go? πŸ’‘ Instead of scaling up, in Ferret-UI Lite, we focused on scaling down, and present our lessons from building small on-device GUI agents. πŸ”— arXiv: https://t.co/kOdfIp2Hde
1
8
19
@lyttonhao
lyttonhao
2 months
Excited to share Manzano from AFM teamβ€”a simple, scalable unified multimodal model for understanding and generation. Manzano shows minimal task conflict, promising scaling behavior and state-of-the-art results among unified models. Paper link: https://t.co/HpziryrvSc
1
8
16
@jiasenlu
Jiasen Lu
2 months
Vision tokenizers are stuck in 2020πŸ€”while language models revolutionized AIπŸš€ Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND
@arankomatsuzaki
Aran Komatsuzaki
2 months
Apple presents AToken: A unified visual tokenizer β€’ First tokenizer unifying images, videos & 3D β€’ Shared 4D latent space (preserves both reconstruction & semantics) β€’ Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
6
72
376
@edaxberger
Erik Daxberger
8 months
Check out our new work on exploring 3D Spatial Understanding with Multimodal LLMs!πŸš€ πŸ“€CA-VQA: A fine-tuning dataset and benchmark w/ various input signals and spatial tasks. πŸ€–MM-Spatial: A generalist MLLM excelling at spatial reasoning. πŸ”— https://t.co/2sByAvJJ0j 🧡(1/n)
3
4
11
@gm8xx8
𝚐π”ͺ𝟾𝚑𝚑𝟾
8 months
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation ο£Ώ
4
38
195
@zy27962986
Zongyu Lin
11 months
πŸš€πŸš€πŸš€Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation modelsπŸ”₯πŸ”₯πŸ”₯. In this work, we aim at providing a transparent and detailed recipe πŸ“– for model architecture, training
9
46
131
@yinfeiy
Yinfei Yang
11 months
πŸš€πŸš€πŸš€ Check out our latest text-image conditioned video gen model STIV. Congrats @zy27962986 and @WeiLiu19963978 for the great work.
@zy27962986
Zongyu Lin
11 months
πŸš€πŸš€πŸš€Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation modelsπŸ”₯πŸ”₯πŸ”₯. In this work, we aim at providing a transparent and detailed recipe πŸ“– for model architecture, training
0
0
9
@RuohongZhang
Ruohong Zhang
1 year
[p1] Improve Visual Language Model Chain-of-thought Reasoning paper link: https://t.co/eUnlisUsv5 project page (to be updated upon approval on release): https://t.co/LpAYt6k8yQ Content: 1. We distill 193K CoT data 2. Train with SFT 3. DPO to futher improve performance
3
38
215
@zhegan4
Zhe Gan
1 year
πŸ’‘CLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info. 🀩🀩 We present CLOC ⏰, our next-generation image encoder, with enhanced localization capabilities, and
15
154
930
@yinfeiy
Yinfei Yang
1 year
Excited to share the MM1.5 work from our team. Thanks to the team for the great work πŸš€πŸš€πŸš€
@zhegan4
Zhe Gan
1 year
πŸš€πŸš€ Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning.
0
10
30
@yinfeiy
Yinfei Yang
2 years
πŸš€πŸš€ Ferret-v2 is here. Checkout the latest work from our group. A new design of referring and grounding MLLM with any resolution input. Significantly improved performance on original Ferret and other baselines. Work lead by @HaotianZhang4AI @XyouH @zhegan4
@zhegan4
Zhe Gan
2 years
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
1
6
23
@HaotianZhang4AI
Haotian Zhang
2 years
πŸš€πŸš€πŸš€ Introducing Ferret-v2, a significant upgrade to Ferret that enhances its detailed visual perception ability. With features like any-resolution referring & grounding, multi-granularity visual encoding, and a three-stage training paradigm, Ferret-v2 sets a new standard.
@zhegan4
Zhe Gan
2 years
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
2
5
22
@_akhaliq
AK
2 years
Apple presents Ferret-v2 An Improved Baseline for Referring and Grounding with Large Language Models While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain
3
58
379
@_akhaliq
AK
2 years
Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
31
385
2K
@mckbrando
Brandon McKinzie
2 years
Thrilled to share MM1!. The MM1 series of models are competitive with Gemini 1 at each of their respective model sizes. Beyond just announcing a new series of models, we also share the ablation results that guided our research process (🧡).
12
90
393
@yinfeiy
Yinfei Yang
2 years
VeCLIP led by @JeffLaiZF @HaotianZhang4AI and @bowen_zhang MOFI led by Wentao wu and Aleksei Timofeev Models are trained using AXLearn ( https://t.co/JbpCUIpjDH)
0
0
1
@yinfeiy
Yinfei Yang
2 years
MOFI is designed as another foundation embedding embedding model good for image-to-image search and achieves SOTA performance on GPR1200 benchmark. Learned from large scale noisy entity annotated images. Will be presented at ICLR 2024.
0
0
3
@yinfeiy
Yinfei Yang
2 years
It is complementary to other well curated image-text datasets, and achieve SoTA level performance on both of text-image retrieval and ImageNet classification (83.07%) benchmarks when combining both . Providing yet another choice for downstream tasks.
0
0
3