Mark Endo
@mark_endo1
Followers
150
Following
235
Media
26
Statuses
56
Computer Science PhD student @Stanford | AI + Health
Stanford, CA
Joined August 2016
Work done together with the fantastic @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Xb2MBLtdSl Project website: https://t.co/L4Arg3vSBd Code:
0
0
1
Our final two-stage approach, Extract+Think, demonstrates extreme parameter and data efficiency, improving over LLaVA-OneVision while using 95% fewer visual training samples.
1
0
0
We then enhance reasoning by applying step-by-step thinking over the extracted visual details, substantially enhancing performance without requiring any additional supervision on visual data.
1
0
0
To address the critical perception bottleneck, we propose visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. We find that this approach dramatically reduces the perception bottleneck.
1
0
0
Next, we ask whether this behavior comes from the expected drop of visual reasoning or also a more fundamental impairment to perception. A decoupled analysis shows that while downscaling expectedly impairs reasoning, it also strongly degrades perception across numerous tasks.
1
0
0
First, we examine how LLM downscaling impacts performance across diverse visual instruction tuning tasks. We find a striking pattern: instead of affecting tasks that rely heavily on the base LLM (e.g., general, knowledge), it is most detrimental to vision-centric capabilities.
1
0
0
TLDR: Vision-centric capabilities degrade under LLM downscaling, driven by both a reasoning decline and a central perception bottleneck. We tackle this with a two-stage framework, featuring a new visual extraction tuning method alongside step-by-step visual reasoning. Findings⬇️
1
0
0
Thinking about using small multimodal models? Want a clearer understanding of what breaks when downscaling model size, and why? ✨Introducing our new work on Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models 🧵👇
1
7
28
There’s growing excitement around VLMs and their potential to transform surgery🏥—but where exactly are we on the path to AI-assisted surgical procedures? In our latest work, we systematically evaluated leading VLMs across major surgical tasks where AI is gaining traction..🧵
2
6
30
🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos? Imagine you’re learning a sports skill like kicking: can an AI tell how your kick differs from an expert video? 🚀 Introducing "Video Action Differencing" (VidDiff), ICLR 2025 🧵
7
48
58
Work done together with the amazing @XiaohanWang96 and @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Yv1TVRn97u Project website:
0
0
1
Strikingly, our approach achieves this performance improvement while only retaining 3.3% of visual tokens for the second half of LLM layers.
1
0
1
We find that FEATHER results in substantial performance gains compared to the original acceleration approach, improving localization performance by more than 5x with comparable computational savings.
1
0
0
Guided by our insights, we propose FEATHER: a simple approach that resolves the issue with early-layer pruning, incorporates uniform sampling, and employs two-stage pruning to balance increased speedup of early-layer pruning and enhanced token selection at a later layer.
1
0
0
We then propose and evaluate alternative pruning criteria. 💡Insights: (1) removing the tendency to select bottom tokens improves early pruning, (2) later pruning still boosts performance, and (3) adding uniform token sampling enhances early-layer pruning.
1
0
0
3️⃣Explaining performance on other tasks. The majority of evaluated benchmarks do not require fine-grained visual grounding, as they can often be answered using only visual tokens located toward the bottom of the image even when pruning before the language model.
1
0
0
2️⃣Interpreting poor vision-centric task performance. The pruning criteria when applied after shallow layers predominantly selects visual tokens from the bottom part of the image.
1
0
0
1️⃣Early token pruning falters in vision-centric tasks. Pruning visual tokens after shallow LLM layers results in a large performance decline for localization benchmarks and moderate decrease for TextVQA, whereas performance remains relatively unchanged for other evaluated tasks.
1
0
0
TLDR: We demonstrate that strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks’ limited ability to assess fine-grained visual capabilities. Let’s dive into the findings! 🚀
1
0
1
How does visual token pruning after early VLM layers maintain strong performance with reduced visual information? The answer may not be what you expect. ✨Introducing our new paper Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration 🧵👇
3
9
23