There are so many vision-language models: OpenAI’s CLIP, Meta’s FLAVA, Salesforce’s ALBEF, etc.
Our
#CVPR2023
⭐️ highlight ⭐️ paper finds that none of them show sufficient compositional reasoning capacity.
Since perception and language are both compositional, we have work to do
Have vision-language models achieved human-level compositional reasoning? Our research suggests: not quite yet.
We’re excited to present CREPE – a large-scale Compositional REPresentation Evaluation benchmark for vision-language models – as a 🌟highlight🌟at
#CVPR2023
.
🧵1/7