Yinfei Yang Profile
Yinfei Yang

@yinfeiy

Followers
440
Following
119
Media
2
Statuses
60

Joined August 2010
Don't wanna be here? Send us removal request.
@yinfeiy
Yinfei Yang
5 months
RT @edaxberger: Check out our new work on exploring 3D Spatial Understanding with Multimodal LLMs!🚀. 📀CA-VQA: A fine-tuning dataset and ben….
0
4
0
@yinfeiy
Yinfei Yang
5 months
RT @gm8xx8: DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation. 
Tweet media one
0
38
0
@grok
Grok
10 days
The most fun image & video creation tool in the world is here. Try it for free in the Grok App.
0
123
980
@yinfeiy
Yinfei Yang
8 months
RT @zy27962986: 🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies f….
0
46
0
@yinfeiy
Yinfei Yang
8 months
🚀🚀🚀 Check out our latest text-image conditioned video gen model STIV. Congrats @zy27962986 and @WeiLiu19963978 for the great work.
@zy27962986
Johnson Lin
8 months
🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation models🔥🔥🔥. In this work, we aim at providing a transparent and detailed recipe 📖 for model architecture, training
Tweet media one
Tweet media two
0
1
8
@yinfeiy
Yinfei Yang
10 months
RT @RuohongZhang: [p1] Improve Visual Language Model Chain-of-thought Reasoning. paper link: project page (to be u….
0
38
0
@yinfeiy
Yinfei Yang
11 months
RT @zhegan4: 💡CLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level seman….
0
156
0
@yinfeiy
Yinfei Yang
11 months
Excited to share the MM1.5 work from our team. Thanks to the team for the great work 🚀🚀🚀.
@zhegan4
Zhe Gan
11 months
🚀🚀 Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning.
Tweet media one
0
10
30
@yinfeiy
Yinfei Yang
1 year
🚀🚀 Ferret-v2 is here. Checkout the latest work from our group. A new design of referring and grounding MLLM with any resolution input. Significantly improved performance on original Ferret and other baselines. Work lead by @HaotianZhang4AI @XyouH @zhegan4.
@zhegan4
Zhe Gan
1 year
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
1
6
23
@yinfeiy
Yinfei Yang
1 year
RT @HaotianZhang4AI: 🚀🚀🚀 Introducing Ferret-v2, a significant upgrade to Ferret that enhances its detailed visual perception ability. With….
0
5
0
@yinfeiy
Yinfei Yang
1 year
RT @_akhaliq: Apple presents Ferret-v2. An Improved Baseline for Referring and Grounding with Large Language Models. While Ferret seamlessl….
0
58
0
@yinfeiy
Yinfei Yang
1 year
RT @_akhaliq: Apple presents Ferret-UI. Grounded Mobile UI Understanding with Multimodal LLMs. Recent advancements in multimodal large lang….
0
389
0
@yinfeiy
Yinfei Yang
1 year
RT @mckbrando: Thrilled to share MM1!. The MM1 series of models are competitive with Gemini 1 at each of their respective model sizes. Beyo….
0
90
0
@yinfeiy
Yinfei Yang
1 year
VeCLIP led by @JeffLaiZF @HaotianZhang4AI and @bowen_zhang . MOFI led by Wentao wu and Aleksei Timofeev . Models are trained using AXLearn (.
0
0
1
@yinfeiy
Yinfei Yang
1 year
MOFI is designed as another foundation embedding embedding model good for image-to-image search and achieves SOTA performance on GPR1200 benchmark. Learned from large scale noisy entity annotated images. Will be presented at ICLR 2024.
Tweet media one
0
0
3
@yinfeiy
Yinfei Yang
1 year
It is complementary to other well curated image-text datasets, and achieve SoTA level performance on both of text-image retrieval and ImageNet classification (83.07%) benchmarks when combining both . Providing yet another choice for downstream tasks.
Tweet media one
0
0
3
@yinfeiy
Yinfei Yang
1 year
VeCLIP introduce detailed visual-enriched captions from LLaVA and Vicuna, incorporating the meta information from its original caption.
1
0
3
@yinfeiy
Yinfei Yang
1 year
🚀🚀 Excited to share that we have released two new model families for image and text embeddings. 1. VeCLIP . Github: ArXiv: 2. MOFI . Github: ArXiv:
4
22
81
@yinfeiy
Yinfei Yang
1 year
RT @zhegan4: 🚀🚀 Excited to release code & ckpt for our new image encoders. 1. VeCLIP: 83.1% 0-shot on ImgNet with….
0
34
0
@yinfeiy
Yinfei Yang
2 years
It is finally there ! Thanks @WilliamWangNLP . Check-out our work on leveraging multimodal LLM for image editing. also with Tsu-jui fu, @wenzehu, @Phyyysalis , and @zhegan4.
@WilliamWangNLP
William Wang
2 years
🤩Apple opensources MGIE! Now one can take random pictures w. iPhone & edit w. language!. Guiding Instruction-based Image Editing via Multimodal Large Language Models #ICLR2024 spotlight: . Apple repo .Gradio
0
3
10
@yinfeiy
Yinfei Yang
2 years
RT @zhegan4: 🎁🎁 Ferret is a multimodal LLM that is able to refer and ground, and is now open-sourced. Find out our code and checkpoints bel….
Tweet card summary image
github.com
Contribute to apple/ml-ferret development by creating an account on GitHub.
0
20
0