
Yinfei Yang
@yinfeiy
Followers
440
Following
119
Media
2
Statuses
60
Joined August 2010
RT @edaxberger: Check out our new work on exploring 3D Spatial Understanding with Multimodal LLMs!🚀. 📀CA-VQA: A fine-tuning dataset and ben….
0
4
0
RT @zy27962986: 🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies f….
0
46
0
🚀🚀🚀 Check out our latest text-image conditioned video gen model STIV. Congrats @zy27962986 and @WeiLiu19963978 for the great work.
🚀🚀🚀Want to develop a cutting-edge video generation model towards Sora? Please dive into Apple’s latest recipe and studies for scalable video generation models🔥🔥🔥. In this work, we aim at providing a transparent and detailed recipe 📖 for model architecture, training
0
1
8
RT @RuohongZhang: [p1] Improve Visual Language Model Chain-of-thought Reasoning. paper link: project page (to be u….
0
38
0
Excited to share the MM1.5 work from our team. Thanks to the team for the great work 🚀🚀🚀.
🚀🚀 Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning.
0
10
30
🚀🚀 Ferret-v2 is here. Checkout the latest work from our group. A new design of referring and grounding MLLM with any resolution input. Significantly improved performance on original Ferret and other baselines. Work lead by @HaotianZhang4AI @XyouH @zhegan4.
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
1
6
23
RT @HaotianZhang4AI: 🚀🚀🚀 Introducing Ferret-v2, a significant upgrade to Ferret that enhances its detailed visual perception ability. With….
0
5
0
RT @mckbrando: Thrilled to share MM1!. The MM1 series of models are competitive with Gemini 1 at each of their respective model sizes. Beyo….
0
90
0
VeCLIP led by @JeffLaiZF @HaotianZhang4AI and @bowen_zhang . MOFI led by Wentao wu and Aleksei Timofeev . Models are trained using AXLearn (.
0
0
1
It is finally there ! Thanks @WilliamWangNLP . Check-out our work on leveraging multimodal LLM for image editing. also with Tsu-jui fu, @wenzehu, @Phyyysalis , and @zhegan4.
🤩Apple opensources MGIE! Now one can take random pictures w. iPhone & edit w. language!. Guiding Instruction-based Image Editing via Multimodal Large Language Models #ICLR2024 spotlight: . Apple repo .Gradio
0
3
10
RT @zhegan4: 🎁🎁 Ferret is a multimodal LLM that is able to refer and ground, and is now open-sourced. Find out our code and checkpoints bel….
github.com
Contribute to apple/ml-ferret development by creating an account on GitHub.
0
20
0