Chun-Hsiao (Daniel) Yeh @danielyehhh X Profile

Chun-Hsiao (Daniel) Yeh

@danielyehhh

Followers

130

Following

423

Media

8

Statuses

22

Research Intern @FAIR, Meta | PhD student @UCBerkeley

https://t.co/YqHg0Uvn1B

Berkeley, CA

Joined November 2016

Don't wanna be here? Send us removal request.

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: https://t.co/yT9aHD3fwm

2

27

79

Yi Ma

@YiMaTweets

2 months

Our latest book on the mathematical principles of deep learning and intelligence has been released publicly at: https://t.co/ihPBCkI3x5 It also comes with a customized Chatbot that helps readers study and a Chinese version translated mainly by AI. This is an open-source project.

15

278

1K

Ta-Ying Cheng

@ChengTim0708

4 months

Imagine a Van Gogh-style teapot turning into glass with one simple slider🎨 Introducing MARBLE, material edits by simply changing CLIP embedding! 🔗 https://t.co/VOHGwUGFVZ 👏 Internship project with @prafull7, @markb_boss , @jampani_varun at @StabilityAI

1

5

25

Jasper

@zjasper666

5 months

It’s been 6 years since I did my summer AI research at @YiMaTweets’s lab. Always had great time hanging out with lab mates. Congrats to @simon_zhai and @HaozhiQ on becoming doctors and joining @GoogleDeepMind and @AIatMeta 💜

18

3

81

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

🚀 Glad to see our All-Angles Bench ( https://t.co/2GeMZmS31b) being adopted to evaluate 3D spatial understanding in Seed-1.5-VL-thinking along with OpenAI (o1) and Gemini 2.5 Pro..!

github.com

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs - Chenyu-Wang567/All-Angles-Bench

Yujia Qin

@TsingYoga

5 months

Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 https://t.co/MOWaHM8leh

0

8

23

Yujia Qin

@TsingYoga

5 months

Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 https://t.co/MOWaHM8leh

8

83

462

Yi Ma

@YiMaTweets

5 months

It seems there is still a long way to go for multi-modal large models to truly understand space and scene.

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: https://t.co/yT9aHD3fwm

2

12

52

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[n/n] huge thanks to my amazing collaborators (@Chenyuu_Wang, @TongPetersb, @ChengTim0708, @TianzheC, @simon_zhai, @Yubei_Chen, @hi_ice_boy, @YiMaTweets) 🔗 ArXiv: https://t.co/PYI4kSFklv 💻 Code: https://t.co/fipQeSjxZE 🤗 Hugging Face Benchmark:

huggingface.co

0

3

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[7/n] 📷 While GPT-4o and Gemini-2.0-Flash handle single-view scene reconstruction reasonably well, they falter in aligning multi-view perspectives. 🧭 Poor camera pose estimation → flawed directional reasoning → weak multi-view consistency.

1

2

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[6/n] 🧠 We test CoT methods on GPT-4o, Ovis2, and InternVL2.5 under full & partial views. 📈 CoT helps GPT-4o in partial-view counting, but shows little gain on strong models like InternVL2.5. ⚠️ Takeaway: Prompting isn’t enough—multi-view reasoning needs specialized training.

1

0

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[5/n] 🔍 We analyze MLLM consistency using paired QAs: ✅ CC = both correct, ❌ WW = both wrong, ⚠️ IC = inconsistent. 1️⃣ GPT-4o shows ~70% IC on relative distance—highly unstable! 2️⃣ All models >40% IC on relative direction → struggles w/ orientation.

1

0

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[4/n] 🤔 While we evaluate 27 MLLMs, we have two findings: Finding 1️⃣: Simple task for human like coarse camera pose estimation poses challenges for MLLMs. Finding 2️⃣: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.

1

0

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[3/n] 🧠 How we built All-Angles Bench: (1) Curated 90 diverse multi-view scenes & 6 task types (2) Generated questions via MLLMs + refined w/ human annotation (3) Created cross-view question pairs to test consistency & visual grounding

1

0

Chun-Hsiao (Daniel) Yeh

@danielyehhh

5 months

[2/n] 🧠All-Angles Bench comprises six challenging tasks— counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation. These question types are designed to investigate several major aspects of 3D scene understanding.

1

0

2

Chun-Hsiao (Daniel) Yeh

@danielyehhh

2 years

Surprising that diffusions models already have these capabilities without the need of further training!! Congrats @ChengTim0708

Ta-Ying Cheng

@ChengTim0708

2 years

Today, with my collaborators @prafull7 (MIT CSAIL), @jampani_varun (@StabilityAI ), and my supervisors Niki Trigoni and Andrew Markham, we share with you ZeST, a zero-shot, training free method for image-to-image material transfer! Project Page: https://t.co/0fsl32S07t 1/8

1

0

1

Chun-Hsiao (Daniel) Yeh

@danielyehhh

2 years

Thanks, @_akhaliq for sharing our work! 🙏 Huge props to the @ChengTim0708 @hyhsiehlouis @chuanenlin @HTKung236938 @YiMaTweets @Yubei_Chen for making it all happen🙌 With 🏞️Gen4Gen, you can easily compose your own images into realistic scenes, complete with rich text details!

AK

@_akhaliq

2 years

Gen4Gen Generative Data Pipeline for Generative Multi-Concept Composition Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This

0

6

36

AK

@_akhaliq

2 years

Gen4Gen Generative Data Pipeline for Generative Multi-Concept Composition Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This

6

65

238

Saining Xie

@sainingxie

2 years

Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we

42

532

3K

Chun-Hsiao (Daniel) Yeh

@danielyehhh

2 years

Our groundbreaking work enables personalized search, allowing you to easily find specific moments in videos where your personal instances appear! Our poster is in the morning session tomorrow (tag: THU-AM-252) on Thursday, June 22nd. #CVPR2023 @FabianCabaH

AK

@_akhaliq

2 years

Meta-Personalizing Vision-Language Models to Find Named Instances in Video paper page: https://t.co/whF6qauh7g Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they

0

AK

@_akhaliq

2 years

Meta-Personalizing Vision-Language Models to Find Named Instances in Video paper page: https://t.co/whF6qauh7g Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they

1

15

96