Enxin Song @EnxinSong X Profile

Enxin Song

@EnxinSong

Followers

126

Following

20

Media

9

Statuses

16

Vision & Language

https://t.co/QZBgtZ5AbU

Joined May 2023

Don't wanna be here? Send us removal request.

Wenhao Chai

@wenhaocha1

16 days

Our paper Video-MMLU has been awarded Outstanding Paper at the ICCV Workshop! I happened to receive this wonderful news while soaking in the water couldn’t be happier! Huge thanks to the Knowledge-Intensive Multimodal Reasoning Workshop Committee for the honor.

Enxin Song

@EnxinSong

6 months

🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑‍🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.

4

7

79

Enxin Song

@EnxinSong

1 month

For more experiments & findings, check out: 📄 Paper: https://t.co/ZmJm6ZUbxc 🌐 Project: https://t.co/9P2DBkhsuu 💻 Code: https://t.co/aTru16jVFn 🤗 Model:

huggingface.co

0

1

5

Enxin Song

@EnxinSong

1 month

4️⃣ While attention sinks have been widely studied in recent LLMs such as GPT-OSS, we observe that learnable sparse mechanisms in VideoNSA can induce dynamic attention sinks, and the effect is branch-specific.

1

0

6

Enxin Song

@EnxinSong

1 month

3️⃣ We deeply analyze the roles of the compression, selection, and sliding-window branches in VideoNSA, focusing on their cross-layer distributions and inter-head similarities.

1

0

3

Enxin Song

@EnxinSong

1 month

2️⃣ We varied the ratio of global vs. local attention, and further scaling the attention budget. VideoNSA attains leading performance with only 3.6% of the full attention.

1

0

2

Enxin Song

@EnxinSong

1 month

We conduct massive experiments and reveal several key findings 👇 1️⃣ VideoNSA scales reliably to 128K vision–text contexts, with task-dependent budgeting on tokens per frame and frame number.

1

0

3

Enxin Song

@EnxinSong

1 month

Token compression causes irreversible information loss in video understanding. 🤔 What can we do with sparse attention? We introduce VideoNSA, a hardware-aware and learnable hybrid sparse attention mechanism that scales to 128K context length.

2

20

144

Wenhao Chai

@wenhaocha1

6 months

🎉 We’re excited to host two challenges at LOVE: Multimodal Video Agent Workshop at CVPR 2025, advancing the frontier of video-language understanding! @CVPR #CVPR2025 📌 Track 1A: [VDC] Video Detailed Captioning Challenge Generate rich and structured captions that cover multiple

sites.google.com

Track 1A: Video Detailed Captioning Challenge This track invites participants to advance video understanding by generating rich and structured captions that cover multiple aspects of each video. The...

2

13

43

Enxin Song

@EnxinSong

6 months

For more details, please check below： 🌐 Website: https://t.co/qQArFsTg2P 📄 Paper: https://t.co/FZ9Liqfcs9 💻 Code: https://t.co/FiseKbC91L 📚 Data: https://t.co/AXKFd6UaVr We thank the support of Lambda, Inc. for providing compute resources for this project. @LambdaAPI

enxinsong.com

Video-MMLU pushes LMMs to the limits—can the model really understand real-world lectures?

0

2

Enxin Song

@EnxinSong

6 months

We also explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

1

0

3

Enxin Song

@EnxinSong

6 months

We evaluated 90+ models, including vision-blind baselines, open-source models and proprietary ones. 📉 We find that existing models generally perform poorly, with accuracy ranging from only 10% to 50%.

1

0

1

Enxin Song

@EnxinSong

6 months

Each video comes with two tasks: 📝 Take Notes — detailed captioning of multi-discipline lectures 🧠 Do Quiz — open-ended QA to test reasoning over visuals & proofs Can your model learn like a student and reason like a scientist?

1

0

1

Enxin Song

@EnxinSong

6 months

🎉 Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry! 🧑‍🏫📚Video-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.

1

4

16

Yujin Tang

@yujintang99

11 months

Token Merging Realted Papers Note: ICLR25 Papers are submissions.

1

8

64

Wenhao Chai

@wenhaocha1

1 year

🔥 MovieChat recently received its 100th citation. Thank you all for your support! A year after its release, we’ve updated MovieChat in CVPR 2024, the first large multimodal model designed for long video understanding. Thanks to its training-free design, we’ve upgraded the

0

11

59

Wenhao Chai

@wenhaocha1

1 year

(1/n) 📸Want to train a better video generation model? You need a better video captioner first! Check out AuroraCap! Trained on 20M+ high-quality data, it’s more efficient at inference. #lmms #videogen 🔗Homepage: https://t.co/FTjQxrkLEY

4

12

56