Mark Endo @mark_endo1 X Profile

Mark Endo

@mark_endo1

Followers

150

Following

235

Media

26

Statuses

56

Computer Science PhD student @Stanford | AI + Health

https://t.co/NfappMeBHo

Stanford, CA

Joined August 2016

Don't wanna be here? Send us removal request.

Mark Endo

@mark_endo1

2 days

Work done together with the fantastic @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Xb2MBLtdSl Project website: https://t.co/L4Arg3vSBd Code:

0

1

Mark Endo

@mark_endo1

2 days

Our final two-stage approach, Extract+Think, demonstrates extreme parameter and data efficiency, improving over LLaVA-OneVision while using 95% fewer visual training samples.

1

0

Mark Endo

@mark_endo1

2 days

We then enhance reasoning by applying step-by-step thinking over the extracted visual details, substantially enhancing performance without requiring any additional supervision on visual data.

1

0

Mark Endo

@mark_endo1

2 days

To address the critical perception bottleneck, we propose visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. We find that this approach dramatically reduces the perception bottleneck.

1

0

Mark Endo

@mark_endo1

2 days

Next, we ask whether this behavior comes from the expected drop of visual reasoning or also a more fundamental impairment to perception. A decoupled analysis shows that while downscaling expectedly impairs reasoning, it also strongly degrades perception across numerous tasks.

1

0

Mark Endo

@mark_endo1

2 days

First, we examine how LLM downscaling impacts performance across diverse visual instruction tuning tasks. We find a striking pattern: instead of affecting tasks that rely heavily on the base LLM (e.g., general, knowledge), it is most detrimental to vision-centric capabilities.

1

0

Mark Endo

@mark_endo1

2 days

TLDR: Vision-centric capabilities degrade under LLM downscaling, driven by both a reasoning decline and a central perception bottleneck. We tackle this with a two-stage framework, featuring a new visual extraction tuning method alongside step-by-step visual reasoning. Findings⬇️

1

0

Mark Endo

@mark_endo1

2 days

Thinking about using small multimodal models? Want a clearer understanding of what breaks when downscaling model size, and why? ✨Introducing our new work on Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models 🧵👇

1

7

28

Josiah Aklilu

@AkliluJosiah2

8 months

There’s growing excitement around VLMs and their potential to transform surgery🏥—but where exactly are we on the path to AI-assisted surgical procedures? In our latest work, we systematically evaluated leading VLMs across major surgical tasks where AI is gaining traction..🧵

2

6

30

James Burgess (at CVPR)

@jmhb0

9 months

🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos? Imagine you’re learning a sports skill like kicking: can an AI tell how your kick differs from an expert video? 🚀 Introducing "Video Action Differencing" (VidDiff), ICLR 2025 🧵

7

48

58

Mark Endo

@mark_endo1

11 months

Work done together with the amazing @XiaohanWang96 and @yeung_levy at @StanfordAILab! Read our paper here: https://t.co/Yv1TVRn97u Project website:

0

1

Mark Endo

@mark_endo1

11 months

Strikingly, our approach achieves this performance improvement while only retaining 3.3% of visual tokens for the second half of LLM layers.

1

0

1

Mark Endo

@mark_endo1

11 months

We find that FEATHER results in substantial performance gains compared to the original acceleration approach, improving localization performance by more than 5x with comparable computational savings.

1

0

Mark Endo

@mark_endo1

11 months

Guided by our insights, we propose FEATHER: a simple approach that resolves the issue with early-layer pruning, incorporates uniform sampling, and employs two-stage pruning to balance increased speedup of early-layer pruning and enhanced token selection at a later layer.

1

0

Mark Endo

@mark_endo1

11 months

We then propose and evaluate alternative pruning criteria. 💡Insights: (1) removing the tendency to select bottom tokens improves early pruning, (2) later pruning still boosts performance, and (3) adding uniform token sampling enhances early-layer pruning.

1

0

Mark Endo

@mark_endo1

11 months

3️⃣Explaining performance on other tasks. The majority of evaluated benchmarks do not require fine-grained visual grounding, as they can often be answered using only visual tokens located toward the bottom of the image even when pruning before the language model.

1

0

Mark Endo

@mark_endo1

11 months

2️⃣Interpreting poor vision-centric task performance. The pruning criteria when applied after shallow layers predominantly selects visual tokens from the bottom part of the image.

1

0

Mark Endo

@mark_endo1

11 months

1️⃣Early token pruning falters in vision-centric tasks. Pruning visual tokens after shallow LLM layers results in a large performance decline for localization benchmarks and moderate decrease for TextVQA, whereas performance remains relatively unchanged for other evaluated tasks.

1

0

Mark Endo

@mark_endo1

11 months

TLDR: We demonstrate that strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks’ limited ability to assess fine-grained visual capabilities. Let’s dive into the findings! 🚀

1

0

1

Mark Endo

@mark_endo1

11 months

How does visual token pruning after early VLM layers maintain strong performance with reduced visual information? The answer may not be what you expect. ✨Introducing our new paper Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration 🧵👇

3

9

23