Enxin Song Profile
Enxin Song

@EnxinSong

Followers
126
Following
20
Media
9
Statuses
16

Vision & Language

Joined May 2023
Don't wanna be here? Send us removal request.
@wenhaocha1
Wenhao Chai
16 days
Our paper Video-MMLU has been awarded Outstanding Paper at the ICCV Workshop! I happened to receive this wonderful news while soaking in the water couldn’t be happier! Huge thanks to the Knowledge-Intensive Multimodal Reasoning Workshop Committee for the honor.
@EnxinSong
Enxin Song
6 months
🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑‍🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
4
7
79
@EnxinSong
Enxin Song
1 month
For more experiments & findings, check out: 📄 Paper: https://t.co/ZmJm6ZUbxc 🌐 Project: https://t.co/9P2DBkhsuu 💻 Code: https://t.co/aTru16jVFn 🤗 Model:
Tweet card summary image
huggingface.co
0
1
5
@EnxinSong
Enxin Song
1 month
4️⃣ While attention sinks have been widely studied in recent LLMs such as GPT-OSS, we observe that learnable sparse mechanisms in VideoNSA can induce dynamic attention sinks, and the effect is branch-specific.
1
0
6
@EnxinSong
Enxin Song
1 month
3️⃣ We deeply analyze the roles of the compression, selection, and sliding-window branches in VideoNSA, focusing on their cross-layer distributions and inter-head similarities.
1
0
3
@EnxinSong
Enxin Song
1 month
2️⃣ We varied the ratio of global vs. local attention, and further scaling the attention budget. VideoNSA attains leading performance with only 3.6% of the full attention.
1
0
2
@EnxinSong
Enxin Song
1 month
We conduct massive experiments and reveal several key findings 👇 1️⃣ VideoNSA scales reliably to 128K vision–text contexts, with task-dependent budgeting on tokens per frame and frame number.
1
0
3
@EnxinSong
Enxin Song
1 month
Token compression causes irreversible information loss in video understanding. 🤔 What can we do with sparse attention? We introduce VideoNSA, a hardware-aware and learnable hybrid sparse attention mechanism that scales to 128K context length.
2
20
144
@wenhaocha1
Wenhao Chai
6 months
🎉 We’re excited to host two challenges at LOVE: Multimodal Video Agent Workshop at CVPR 2025, advancing the frontier of video-language understanding! @CVPR #CVPR2025 📌 Track 1A: [VDC] Video Detailed Captioning Challenge Generate rich and structured captions that cover multiple
Tweet card summary image
sites.google.com
Track 1A: Video Detailed Captioning Challenge This track invites participants to advance video understanding by generating rich and structured captions that cover multiple aspects of each video. The...
2
13
43
@EnxinSong
Enxin Song
6 months
For more details, please check below: 🌐 Website: https://t.co/qQArFsTg2P 📄 Paper: https://t.co/FZ9Liqfcs9 💻 Code: https://t.co/FiseKbC91L 📚 Data: https://t.co/AXKFd6UaVr We thank the support of Lambda, Inc. for providing compute resources for this project. @LambdaAPI
Tweet card summary image
enxinsong.com
Video-MMLU pushes LMMs to the limits—can the model really understand real-world lectures?
0
0
2
@EnxinSong
Enxin Song
6 months
We also explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
1
0
3
@EnxinSong
Enxin Song
6 months
We evaluated 90+ models, including vision-blind baselines, open-source models and proprietary ones. 📉 We find that existing models generally perform poorly, with accuracy ranging from only 10% to 50%.
1
0
1
@EnxinSong
Enxin Song
6 months
Each video comes with two tasks: 📝 Take Notes — detailed captioning of multi-discipline lectures 🧠 Do Quiz — open-ended QA to test reasoning over visuals & proofs Can your model learn like a student and reason like a scientist?
1
0
1
@EnxinSong
Enxin Song
6 months
🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑‍🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
1
4
16
@yujintang99
Yujin Tang
11 months
Token Merging Realted Papers Note: ICLR25 Papers are submissions.
1
8
64
@wenhaocha1
Wenhao Chai
1 year
🔥 MovieChat recently received its 100th citation. Thank you all for your support! A year after its release, we’ve updated MovieChat in CVPR 2024, the first large multimodal model designed for long video understanding. Thanks to its training-free design, we’ve upgraded the
0
11
59
@wenhaocha1
Wenhao Chai
1 year
(1/n) 📸Want to train a better video generation model? You need a better video captioner first! Check out AuroraCap! Trained on 20M+ high-quality data, it’s more efficient at inference. #lmms #videogen 🔗Homepage: https://t.co/FTjQxrkLEY
4
12
56