
Chun-Hsiao (Daniel) Yeh
@danielyehhh
Followers
130
Following
423
Media
8
Statuses
22
Research Intern @FAIR, Meta | PhD student @UCBerkeley
Berkeley, CA
Joined November 2016
❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: https://t.co/yT9aHD3fwm
2
27
79
Our latest book on the mathematical principles of deep learning and intelligence has been released publicly at: https://t.co/ihPBCkI3x5 It also comes with a customized Chatbot that helps readers study and a Chinese version translated mainly by AI. This is an open-source project.
15
278
1K
Imagine a Van Gogh-style teapot turning into glass with one simple slider🎨 Introducing MARBLE, material edits by simply changing CLIP embedding! 🔗 https://t.co/VOHGwUGFVZ 👏 Internship project with @prafull7, @markb_boss , @jampani_varun at @StabilityAI
1
5
25
It’s been 6 years since I did my summer AI research at @YiMaTweets’s lab. Always had great time hanging out with lab mates. Congrats to @simon_zhai and @HaozhiQ on becoming doctors and joining @GoogleDeepMind and @AIatMeta 💜
18
3
81
🚀 Glad to see our All-Angles Bench ( https://t.co/2GeMZmS31b) being adopted to evaluate 3D spatial understanding in Seed-1.5-VL-thinking along with OpenAI (o1) and Gemini 2.5 Pro..!
github.com
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs - Chenyu-Wang567/All-Angles-Bench
Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 https://t.co/MOWaHM8leh
0
8
23
Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 https://t.co/MOWaHM8leh
8
83
462
It seems there is still a long way to go for multi-modal large models to truly understand space and scene.
❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: https://t.co/yT9aHD3fwm
2
12
52
[n/n] huge thanks to my amazing collaborators (@Chenyuu_Wang, @TongPetersb, @ChengTim0708, @TianzheC, @simon_zhai, @Yubei_Chen, @hi_ice_boy, @YiMaTweets) 🔗 ArXiv: https://t.co/PYI4kSFklv 💻 Code: https://t.co/fipQeSjxZE 🤗 Hugging Face Benchmark:
huggingface.co
0
0
3
[7/n] 📷 While GPT-4o and Gemini-2.0-Flash handle single-view scene reconstruction reasonably well, they falter in aligning multi-view perspectives. 🧭 Poor camera pose estimation → flawed directional reasoning → weak multi-view consistency.
1
1
2
[6/n] 🧠 We test CoT methods on GPT-4o, Ovis2, and InternVL2.5 under full & partial views. 📈 CoT helps GPT-4o in partial-view counting, but shows little gain on strong models like InternVL2.5. ⚠️ Takeaway: Prompting isn’t enough—multi-view reasoning needs specialized training.
1
0
0
[5/n] 🔍 We analyze MLLM consistency using paired QAs: ✅ CC = both correct, ❌ WW = both wrong, ⚠️ IC = inconsistent. 1️⃣ GPT-4o shows ~70% IC on relative distance—highly unstable! 2️⃣ All models >40% IC on relative direction → struggles w/ orientation.
1
0
0
[4/n] 🤔 While we evaluate 27 MLLMs, we have two findings: Finding 1️⃣: Simple task for human like coarse camera pose estimation poses challenges for MLLMs. Finding 2️⃣: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.
1
0
0
[3/n] 🧠 How we built All-Angles Bench: (1) Curated 90 diverse multi-view scenes & 6 task types (2) Generated questions via MLLMs + refined w/ human annotation (3) Created cross-view question pairs to test consistency & visual grounding
1
0
0
[2/n] 🧠All-Angles Bench comprises six challenging tasks— counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation. These question types are designed to investigate several major aspects of 3D scene understanding.
1
0
2
Surprising that diffusions models already have these capabilities without the need of further training!! Congrats @ChengTim0708
Today, with my collaborators @prafull7 (MIT CSAIL), @jampani_varun (@StabilityAI ), and my supervisors Niki Trigoni and Andrew Markham, we share with you ZeST, a zero-shot, training free method for image-to-image material transfer! Project Page: https://t.co/0fsl32S07t 1/8
1
0
1
Thanks, @_akhaliq for sharing our work! 🙏 Huge props to the @ChengTim0708 @hyhsiehlouis @chuanenlin @HTKung236938 @YiMaTweets @Yubei_Chen for making it all happen🙌 With 🏞️Gen4Gen, you can easily compose your own images into realistic scenes, complete with rich text details!
Gen4Gen Generative Data Pipeline for Generative Multi-Concept Composition Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This
0
6
36
Gen4Gen Generative Data Pipeline for Generative Multi-Concept Composition Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This
6
65
238
Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we
42
532
3K
Our groundbreaking work enables personalized search, allowing you to easily find specific moments in videos where your personal instances appear! Our poster is in the morning session tomorrow (tag: THU-AM-252) on Thursday, June 22nd. #CVPR2023 @FabianCabaH
Meta-Personalizing Vision-Language Models to Find Named Instances in Video paper page: https://t.co/whF6qauh7g Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they
0
0
0
Meta-Personalizing Vision-Language Models to Find Named Instances in Video paper page: https://t.co/whF6qauh7g Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they
1
15
96