Siting Li Profile
Siting Li

@SitingLi627

Followers
86
Following
3
Media
1
Statuses
18

PhD student @uwcse

Joined October 2023
Don't wanna be here? Send us removal request.
@SitingLi627
Siting Li
2 months
Excited to share that our paper "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" is accepted to #ACL2025! .Preprint: Thank @SimonShaoleiDu and @PangWeiKoh so much for your support and guidance throughout the journey!.
2
13
49
@SitingLi627
Siting Li
12 days
RT @thao_nguyen26: Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔.We propose Recycling the Web to break the data wall….
0
59
0
@SitingLi627
Siting Li
13 days
RT @avibose22: 🧠 Your LLM should model how you think, not reduce you to preassigned traits.📢 Introducing LoRe: a low-rank reward modeling f….
0
26
0
@SitingLi627
Siting Li
23 days
RT @RulinShao: 🎉Our Spurious Rewards is available on ArXiv! We added experiments on.- More prompts/steps/models/analysis. - Spurious Prom….
0
40
0
@SitingLi627
Siting Li
27 days
RT @jcqln_h: LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generati….
0
18
0
@SitingLi627
Siting Li
1 month
RT @StellaLisy: 🤯 We cracked RLVR with. Random Rewards?!.Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:.- Rando….
0
338
0
@SitingLi627
Siting Li
2 months
Inspiration for future research: (1) Effectively utilizing vision encoders offers benefits without pretraining new vision models. (2) Promptable image embeddings boost performance on fine-grained tasks. Feel free to contact me if you are interested in our work!.
0
0
2
@SitingLi627
Siting Li
2 months
Finding 7: The question in the prompt greatly helps the extraction and utilization of visual information. We remove the question in the prompt for the evaluation of LLaVA-1.5-VLM2Vec, and the performance drops to CLIP's level.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 6: Text generation is not the only solution to fine-grained visual reasoning. We convert the LLaVA-1.5 model to a CLIP-like contrastive VLM using VLM2Vec. Surprisingly, it still performs well on What'sUp!.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 5: A stronger text encoder does not suffice towards solving the task. We replace the CLIP text encoder by the text encoder in LLM2CLIP, but the accuracy is still low.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 4: Using multiple text tokens similarly does not help. We use all text tokens of CLIP text encoder, but observe no improvement.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 3: Part of the information comes from the order of patch tokens. If we use all patch tokens and their order using RoPE when finetuning CLIP, the performance improves!.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 2: Detailed information resides in the patch tokens. If trained with only the [CLS] token of CLIP vision encoder, LLaVA-1.5's performance will drop a lot.
1
0
2
@SitingLi627
Siting Li
2 months
Finding 1: Training data alone does not lead to stronger extraction ability. We finetune the CLIP using converted LLaVA-1.5's data, but found no big improvement. Thus, training data alone does not help.
1
0
2
@SitingLi627
Siting Li
2 months
This exploration originates from our observation that Generative MLLMs using the same, fixed CLIP vision encoder and weights achieve significantly higher accuracy than CLIP in many visual reasoning benchmarks like Winoground, NaturalBench, MMVP, and What'sUp. Then we conducted a.
1
0
2
@SitingLi627
Siting Li
2 months
We found that the CLIP vision encoder is more powerful than you might think, if you extract and utilize its embedded visual information more effectively, e.g., using a Generative MLLM as the extractor! . Wondering how and why? Check the details below:
Tweet media one
1
0
2