Siting Li @SitingLi627 X Profile

Siting Li

@SitingLi627

Followers

86

Following

3

Media

1

Statuses

18

PhD student @uwcse

Joined October 2023

Don't wanna be here? Send us removal request.

Siting Li

@SitingLi627

2 months

Excited to share that our paper "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" is accepted to #ACL2025! .Preprint: Thank @SimonShaoleiDu and @PangWeiKoh so much for your support and guidance throughout the journey!.

2

13

49

Siting Li

@SitingLi627

12 days

RT @thao_nguyen26: Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔.We propose Recycling the Web to break the data wall….

0

59

0

Siting Li

@SitingLi627

13 days

RT @avibose22: 🧠 Your LLM should model how you think, not reduce you to preassigned traits.📢 Introducing LoRe: a low-rank reward modeling f….

0

26

0

Siting Li

@SitingLi627

23 days

RT @RulinShao: 🎉Our Spurious Rewards is available on ArXiv! We added experiments on.- More prompts/steps/models/analysis. - Spurious Prom….

0

40

0

Siting Li

@SitingLi627

27 days

RT @jcqln_h: LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generati….

0

18

0

Siting Li

@SitingLi627

1 month

RT @StellaLisy: 🤯 We cracked RLVR with. Random Rewards?!.Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:.- Rando….

0

338

0

Siting Li

@SitingLi627

2 months

Inspiration for future research: (1) Effectively utilizing vision encoders offers benefits without pretraining new vision models. (2) Promptable image embeddings boost performance on fine-grained tasks. Feel free to contact me if you are interested in our work!.

0

2

Siting Li

@SitingLi627

2 months

Finding 7: The question in the prompt greatly helps the extraction and utilization of visual information. We remove the question in the prompt for the evaluation of LLaVA-1.5-VLM2Vec, and the performance drops to CLIP's level.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 6: Text generation is not the only solution to fine-grained visual reasoning. We convert the LLaVA-1.5 model to a CLIP-like contrastive VLM using VLM2Vec. Surprisingly, it still performs well on What'sUp!.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 5: A stronger text encoder does not suffice towards solving the task. We replace the CLIP text encoder by the text encoder in LLM2CLIP, but the accuracy is still low.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 4: Using multiple text tokens similarly does not help. We use all text tokens of CLIP text encoder, but observe no improvement.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 3: Part of the information comes from the order of patch tokens. If we use all patch tokens and their order using RoPE when finetuning CLIP, the performance improves!.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 2: Detailed information resides in the patch tokens. If trained with only the [CLS] token of CLIP vision encoder, LLaVA-1.5's performance will drop a lot.

1

0

2

Siting Li

@SitingLi627

2 months

Finding 1: Training data alone does not lead to stronger extraction ability. We finetune the CLIP using converted LLaVA-1.5's data, but found no big improvement. Thus, training data alone does not help.

1

0

2

Siting Li

@SitingLi627

2 months

This exploration originates from our observation that Generative MLLMs using the same, fixed CLIP vision encoder and weights achieve significantly higher accuracy than CLIP in many visual reasoning benchmarks like Winoground, NaturalBench, MMVP, and What'sUp. Then we conducted a.

1

0

2

Siting Li

@SitingLi627

2 months

We found that the CLIP vision encoder is more powerful than you might think, if you extract and utilize its embedded visual information more effectively, e.g., using a Generative MLLM as the extractor! . Wondering how and why? Check the details below:

1

0

2