rid
@ridouaneg_
Followers
34
Following
181
Media
7
Statuses
13
Joined March 2024
(1/8) π¬ Introducing the Short Film Dataset (SFD), a long video QA benchmark with 1k short films and 5k questions. Why another videoQA dataset? π Story-level QAs π₯ Publicly available videos π Minimal data leakage β³ Long temporal context questions https://t.co/FJQzIRgDxV
2
12
24
π¨ #ICCV2025 Workshop Alert! π¨ π₯[COMIQ] Comic Intelligence Quotient: Advances and Challenges in AI-driven Comic Analysis Weβre exploring how machines interpret abstract visual storytelling media; dare I say, a true test for AGI π«£ π Pls consider submitting abstracts. Linkπ
1
5
9
Nice benchmark using movies to check if LLMs understand characters' mental states (like beliefs and intents)! This ability is crucial for developing AIs that can live among humans and learn from them.
Excited to finally share MOMENTS!! A new human-annotated benchmark to evaluate Theory of Mind in multimodal LLMs using long-form videos with real human actors. π½οΈ 2.3K+ MCQA items from 168 short films π§ Tests 7 different ToM abilities π
0
0
2
Participate in our VideoQA competition! π Winners get to present their work at the SLoMO workshop #ICCV2025
https://t.co/MJs7ZsxocN
huggingface.co
Movies are more than just video clips, they are stories! π¬ Weβre hosting the 1st SLoMO Workshop at #ICCV2025 to discuss Story-Level Movie Understanding & Audio Descriptions! Website: https://t.co/k1hDRCFjjd Competition: https://t.co/JseLilr6oc
0
1
5
π Guessing where an image was taken is a hard, and often ambiguous problem. Introducing diffusion-based geolocationβwe predict global locations by refining random guesses into trajectories across the Earth's surface! πΊοΈ Paper, code, and demo: https://t.co/pNRFZk9NYP
6
37
153
(1/8) π¬ Introducing the Short Film Dataset (SFD), a long video QA benchmark with 1k short films and 5k questions. Why another videoQA dataset? π Story-level QAs π₯ Publicly available videos π Minimal data leakage β³ Long temporal context questions https://t.co/FJQzIRgDxV
2
12
24
(8/8) π¬ Work done in collaboration with @xiwang92, @VickyKalogeiton, and Ivan Laptev. We thank all the people that helped us create this benchmark! π½οΈ: https://t.co/FJQzIRg5In π: https://t.co/4IVacjuxLo π€: https://t.co/UXUg9X4V2h
0
0
5
(7/8) Finally, we show that increasing the input context window (going from shot-level to movie-level information as input) improves task performance. This intuitive result confirms our approach and provides what we think is a valuable benchmark for the community.
1
0
4
(6/8) We also argue that modern VLMs are mature enough for open-ended videoQA. Therefore we introduce this task in our benchmark, with LLM-scoring as a metric. All models face difficulties on this task.
1
0
4
(5/8) Most models struggle (< 40%) compared to human perf (~90%). Only LLoVi, based on GPT-3.5, performs well (55.6%). It mostly relies on subtitles, while underperforming with vision-only. This highlights the need for truly multimodal methods that better integrate vision.
1
0
4
(4/8) We create 5k questions by leveraging LLMs, followed by a careful manual curation. Our efforts focus on developing three types of questions: setting-, character-, and story-related questions.
1
0
4
(3/8) We solve this issue by proposing a dataset of short films, i.e. amateur movies (5 to 20 mins) created by filmmakers to explore new ideas or to promote their work. Many high-quality films are publicly available on YouTube and they are unknown to LLMs.
1
0
4
(2/8) Movies are great to benchmark VLMs but they suffer from data leakage, i.e. modern LLMs memorized common movies and can answer questions given only movie names. For example, GPT4V achieve 71.3% accuracy on MovieQA without even watching the movies.
1
0
4