Michal Golovanevsky
@MichalGolov
Followers
59
Following
21
Media
7
Statuses
29
CS PhD student @BrownCSDept | Multimodal Learning | Mechanistic Interpretability | Clinical Deep Learning.
Providence, RI
Joined September 2022
Our panel moderated by @dana_arad4 "Evaluating Interpretability Methods: Challenges and Future Directions" just started! Come to hear the takes of @michaelwhanna @MichalGolov Nicolò Brunello @mingyang2666!
1
7
23
How do VLMs balance visual information presented in-context with linguistic priors encoded in-weights? In this project, @MichalGolov and @WilliamRudmanjr find out! My favorite result: you can find a vector that shifts attention to image tokens and changes the VLM's response!
When vision-language models answer questions, are they truly analyzing the image or relying on memorized facts? We introduce Pixels vs. Priors (PvP), a method to control whether VLMs respond based on input pixels or world knowledge priors. [1/5]
0
2
9
Models rely on memorized priors early in their processing but shift toward visual evidence in mid-to-late layers. This shows a competition between visual input and stored knowledge, with pixels often overriding priors at the final prediction. [3/5]
1
1
5
We create Visual CounterFact: a dataset of realistic images that contrast pixel evidence against memorized knowledge. We edit visual attributes to create counterfactual images (a blue strawberry) that directly contradict typical associations (strawberries are red). [2/5]
1
1
4
With PvP, we can shift 92.5% of color predictions and 74.6% of size predictions from memorized priors to counterfactual answers. Code: https://t.co/OPkYfEl5Qz HuggingFace Dataset: mgolov/Visual-Counterfact [5/5]
github.com
Contribute to rsinghlab/pixels_vs_priors development by creating an account on GitHub.
1
1
5
When vision-language models answer questions, are they truly analyzing the image or relying on memorized facts? We introduce Pixels vs. Priors (PvP), a method to control whether VLMs respond based on input pixels or world knowledge priors. [1/5]
1
4
25
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.
0
1
0
0
0
0
Code: https://t.co/Ss1wJTi1au, paper: https://t.co/aHJqlW2N1X, huggingface dataset: https://t.co/uqOT1q6vpb [6/6]
huggingface.co
1
0
2
Our findings serve as a call to action for the MLLM research community. Despite training on specialized datasets, the concept of “sides” has not emerged in MLLMs. [5/6]
1
0
1
We take a step forward with Visually-Cued Chain-of-Thought prompting. While annotations alone do not enhance visual reasoning, combining them with CoT prompting boosts GPT-4o's side counting accuracy on novel shapes from 7% to 93% and improves MathVerse performance by ~7%. [4/6]
1
0
1
In contrast, we find that vision encoders are shape-blind, mapping distinct shapes to the same region in embedding space. As a result, MLLMs struggle to identify and count the sides of pentagons, heptagons, and octagons. [3/6]
1
0
1
Where does the failure of MLLMs occur? We show that the underlying LLMs answer with 100% accuracy on geometric property questions. Ex Q: “How many sides does a heptagon have?” with A: “7”. [2/6]
1
0
1
If SOTA models fail to recognize simple shapes, should we be evaluating them on complex geometric tasks? Most MLLMs struggle with counting the number of sides of regular polygons and all MLLMs receive 0% on novel shapes. @WilliamRudmanjr
@_amirbar @vedantpalit1008 [1/6]
1
6
11
NOTICE uses Symmetric Token Replacement for text corruption and Semantic Image Pairs (SIP) for image corruption. SIP replaces clean images with ones differing in a single semantic property, such as object or emotion, enabling meaningful causal mediation analysis of VLMs. [3/5]
1
1
2
We extend the generalizability of NOTICE by using Stable-Diffusion to generate semantic image pairs and find results are nearly identical to curated semantic image pairs. [4/5]
1
1
1
The finding that important attention heads implement one of a small set of interpretable functions boosts transparency and trust in VLMs. @MichalGolov @vedantpalit1008 #nlp #mechinterp Paper: https://t.co/baxqkEXrxZ GitHub: https://t.co/GLMpam48wH [5/5]
github.com
Contribute to wrudman/NOTICE development by creating an account on GitHub.
0
2
3
How do VLMs like BLIP and LLaVA differ in how they process visual information? Using our mech-interp pipeline for VLMs, NOTICE, we first show important cross-attention heads in BLIP can perform image grounding, whereas important self-attention heads in LLaVA do not. [1/5]
1
2
3
Instead, LLaVA relies on self-attention heads to manage “outlier” attention patterns in the image, focusing on regulating these outliers. Interestingly, some of BLIP's attention heads are also dedicated to reducing attention to outlier features. [2/5]
1
1
2
The finding that important cross-attention heads implement one of a small set of interpretable functions helps boost VLMs' transparency and trust. Paper: https://t.co/baxqkEXZnx GitHub: https://t.co/GLMpam4Gmf [5/5].
github.com
Contribute to wrudman/NOTICE development by creating an account on GitHub.
0
1
3