Mark Endo Profile
Mark Endo

@mark_endo1

Followers
150
Following
235
Media
26
Statuses
56

Computer Science PhD student @Stanford | AI + Health

Stanford, CA
Joined August 2016
Don't wanna be here? Send us removal request.
@mark_endo1
Mark Endo
2 days
Work done together with the fantastic @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Xb2MBLtdSl Project website: https://t.co/L4Arg3vSBd Code:
0
0
1
@mark_endo1
Mark Endo
2 days
Our final two-stage approach, Extract+Think, demonstrates extreme parameter and data efficiency, improving over LLaVA-OneVision while using 95% fewer visual training samples.
1
0
0
@mark_endo1
Mark Endo
2 days
We then enhance reasoning by applying step-by-step thinking over the extracted visual details, substantially enhancing performance without requiring any additional supervision on visual data.
1
0
0
@mark_endo1
Mark Endo
2 days
To address the critical perception bottleneck, we propose visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. We find that this approach dramatically reduces the perception bottleneck.
1
0
0
@mark_endo1
Mark Endo
2 days
Next, we ask whether this behavior comes from the expected drop of visual reasoning or also a more fundamental impairment to perception. A decoupled analysis shows that while downscaling expectedly impairs reasoning, it also strongly degrades perception across numerous tasks.
1
0
0
@mark_endo1
Mark Endo
2 days
First, we examine how LLM downscaling impacts performance across diverse visual instruction tuning tasks. We find a striking pattern: instead of affecting tasks that rely heavily on the base LLM (e.g., general, knowledge), it is most detrimental to vision-centric capabilities.
1
0
0
@mark_endo1
Mark Endo
2 days
TLDR: Vision-centric capabilities degrade under LLM downscaling, driven by both a reasoning decline and a central perception bottleneck. We tackle this with a two-stage framework, featuring a new visual extraction tuning method alongside step-by-step visual reasoning. Findings⬇️
1
0
0
@mark_endo1
Mark Endo
2 days
Thinking about using small multimodal models? Want a clearer understanding of what breaks when downscaling model size, and why? ✨Introducing our new work on Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models 🧵👇
1
7
28
@AkliluJosiah2
Josiah Aklilu
8 months
There’s growing excitement around VLMs and their potential to transform surgery🏥—but where exactly are we on the path to AI-assisted surgical procedures? In our latest work, we systematically evaluated leading VLMs across major surgical tasks where AI is gaining traction..🧵
2
6
30
@jmhb0
James Burgess (at CVPR)
9 months
🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos? Imagine you’re learning a sports skill like kicking: can an AI tell how your kick differs from an expert video? 🚀 Introducing "Video Action Differencing" (VidDiff), ICLR 2025 🧵
7
48
58
@mark_endo1
Mark Endo
11 months
Work done together with the amazing @XiaohanWang96 and @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Yv1TVRn97u Project website:
0
0
1
@mark_endo1
Mark Endo
11 months
Strikingly, our approach achieves this performance improvement while only retaining 3.3% of visual tokens for the second half of LLM layers.
1
0
1
@mark_endo1
Mark Endo
11 months
We find that FEATHER results in substantial performance gains compared to the original acceleration approach, improving localization performance by more than 5x with comparable computational savings.
1
0
0
@mark_endo1
Mark Endo
11 months
Guided by our insights, we propose FEATHER: a simple approach that resolves the issue with early-layer pruning, incorporates uniform sampling, and employs two-stage pruning to balance increased speedup of early-layer pruning and enhanced token selection at a later layer.
1
0
0
@mark_endo1
Mark Endo
11 months
We then propose and evaluate alternative pruning criteria. 💡Insights: (1) removing the tendency to select bottom tokens improves early pruning, (2) later pruning still boosts performance, and (3) adding uniform token sampling enhances early-layer pruning.
1
0
0
@mark_endo1
Mark Endo
11 months
3️⃣Explaining performance on other tasks. The majority of evaluated benchmarks do not require fine-grained visual grounding, as they can often be answered using only visual tokens located toward the bottom of the image even when pruning before the language model.
1
0
0
@mark_endo1
Mark Endo
11 months
2️⃣Interpreting poor vision-centric task performance. The pruning criteria when applied after shallow layers predominantly selects visual tokens from the bottom part of the image.
1
0
0
@mark_endo1
Mark Endo
11 months
1️⃣Early token pruning falters in vision-centric tasks. Pruning visual tokens after shallow LLM layers results in a large performance decline for localization benchmarks and moderate decrease for TextVQA, whereas performance remains relatively unchanged for other evaluated tasks.
1
0
0
@mark_endo1
Mark Endo
11 months
TLDR: We demonstrate that strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks’ limited ability to assess fine-grained visual capabilities. Let’s dive into the findings! 🚀
1
0
1
@mark_endo1
Mark Endo
11 months
How does visual token pruning after early VLM layers maintain strong performance with reduced visual information? The answer may not be what you expect. ✨Introducing our new paper Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration 🧵👇
3
9
23