There are so many vision-language models: OpenAI’s CLIP, Meta’s FLAVA, Salesforce’s ALBEF, etc. Our #CVPR2023 ⭐️ highlight ⭐️ paper finds that none of them show sufficient compositional reasoning capacity. Since perception and language are both compositional, we have work to do Tweet added by Ranjay Krishna @RanjayKrishna

Ranjay Krishna

1 year

There are so many vision-language models: OpenAI’s CLIP, Meta’s FLAVA, Salesforce’s ALBEF, etc. Our #CVPR2023 ⭐️ highlight ⭐️ paper finds that none of them show sufficient compositional reasoning capacity. Since perception and language are both compositional, we have work to do

Zixian Ma @ CVPR2024

@zixianma02

1 year

Have vision-language models achieved human-level compositional reasoning? Our research suggests: not quite yet. We’re excited to present CREPE – a large-scale Compositional REPresentation Evaluation benchmark for vision-language models – as a 🌟highlight🌟at #CVPR2023 . 🧵1/7