Tanveer Hannan (on job market)
@hannan_tanveer
Followers
43
Following
48
Media
20
Statuses
140
๐๐ง ๐ญ๐ก๐ ๐๐ง๐๐ฎ๐ฌ๐ญ๐ซ๐ฒ ๐๐จ๐ ๐๐๐ซ๐ค๐๐ญ | Research Intern @Microsoft | Phd Student @LMU. Computer Vision, Video Understanding, Multimodal, AI Agent
Munich, Germany
Joined December 2013
Our latest paper, DocSLM, developed during my internship at Microsoft, is now on arXiv: https://t.co/P4m7o05SwZ. It is an efficient & compact Vision-Language Model to process long & complex documents while operating on resource-constrained edge devices like mobiles & laptops.
arxiv.org
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for...
2
2
4
Thanks to my co-authors โ @dimi_Mall , Parth Pathak, Faegheh (Fay) Sardari, @thomasseidl, @gberta227, @MohsenFyz , and @sunandosengupta โ for their contributions throughout this project.
1
0
1
โข A scalable stream processor that enables reliable handling of long document sequences (up to 120 pages). โข DocSLM requires 75% fewer parameters, 71% lower latency, maintains a constant memory of 14 GB across varying lengths, while delivering competitive or SOTA performance.
0
0
1
Key Contributions โข A hierarchical compression module that integrates OCR, visual, and layout features into a fixed-length representation, achieving an 82% reduction in visual tokens while preserving essential semantic structure.
1
0
1
๐ Checkout the Spatiotemporal Action Grounding Challenge now featured on the MCML blog! https://t.co/UXbN00vFTQ
mcml.ai
ICCV 2025 workshop: Advancing AI to detect who does what, when, and where โ across space, time, and complex real-world videos.
0
0
0
Checkout our new Challenge/ workshop at @ICCVConference
Exciting news! We're happy to announce our challenge / workshop at this year @ICCVConference focusing on Spatiotemporal Action Grounding in Videos. Here are the details: ๐ท Watch the video below for a demo. ๐ท The eval server is open until 09/19! ๐ท Links incl. code below. #ICCV
0
0
0
0
0
0
We invite the research community to participate, submit their methods, and contribute to shaping the future of spatiotemporal understanding in computer vision. Outstanding submissions will be featured at the ICCV 2025 Workshop.
1
0
0
The benchmark introduces new tasks, datasets, and evaluation protocols to encourage the development of User Instruction based more robust, scalable, and generalizable models for complex, real-world scenarios.
1
0
0
This yearโs challenge is centered on advancing research in: ๐น Multi-Object Tracking ๐น Instruction based Spatiotemporal Detection ๐น Long-Term Temporal Reasoning
1
0
0
๐ฏ Challenge Launch Announcement We are pleased to announce the launch of the MOT25 Challenge, to be held in conjunction with ICCV 2025. ๐ Workshop website: https://t.co/YGg9wphKnT ๐งช The MOT25 Challenge is now live on Codabench:
1
1
1
๐ On the job market! Final-year PhD @ UNC Chapel Hill working on computer vision, video understanding, multimodal LLMs & AI agents. 2x Research Scientist Intern @Meta ๐ Seeking Research Scientist/Engineer roles! ๐ https://t.co/z9ioZPFCi9 ๐ง mmiemon [at] cs [dot] unc [dot] edu
md-mohaiminul.github.io
A highly-customizable Hugo academic resume theme powered by Wowchemy website builder.
0
4
18
๐ Check out our latest work, ReVisionLLM, now featured on the MCML blog! ๐ A Vision-Language Model for accurate temporal grounding in hour-long videos. ๐ https://t.co/cTNNcRLsFE
#VisionLanguage #MultimodalAI #MCML #CVPR2025
mcml.ai
Tanveer Hannan and colleagues introduce ReVisionLLM, an AI model that mimics human skimming to accurately find key moments in long videos.
0
0
2
Great to see a lot of interest among the video understanding community about ReVisionLLM! If you missed it, checkout https://t.co/KAF47QI7yp
@hannan_tanveer
Presenting ReVisionLLM at #CVPR2025 today! Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos If you are at CVPR, please stop by ๐ Poster #307, Session 4 ๐๏ธ June 14, 5โ7PM | ExHall D ๐ https://t.co/qrBvf2UUAo
@hannan_tanveer @gberta227
0
2
10
Excited to have our paper ReVisionLLM presented today at #CVPR2025! Website:
lnkd.in
This link will take you to a page thatโs not on LinkedIn
Presenting ReVisionLLM at #CVPR2025 today! Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos If you are at CVPR, please stop by ๐ Poster #307, Session 4 ๐๏ธ June 14, 5โ7PM | ExHall D ๐ https://t.co/qrBvf2UUAo
@hannan_tanveer @gberta227
0
0
0
Had a great time presenting at the GenAI session @CiscoMerakiโthanks @nahidalam for the invite๐ Catch us at #CVPR2025: ๐ BIMBA: https://t.co/4XCHPFWchy (June 15, 4โ6PM, Poster #282) ๐ ReVisionLLM: https://t.co/KAF47QI7yp (June 14, 5โ7PM, Poster #307) @gberta227 @hannan_tanveer
arxiv.org
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal...
0
3
4
The time for new architectures is over? Not quite! SeNaTra, a native segmentation backbone, is waiting, let's see how it works ๐งต https://t.co/2I9nuLBsSz
arxiv.org
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping...
3
41
206
Effective long-context comprehension remains a significant hurdle for LLMs. Meta's forthcoming Llama 4 aims to address this by iRoPE architecture. I am looking forward to testing them on more real life setups like streaming videos.
Today is the start of a new era of natively multimodal AI innovation. Today, weโre introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick โ our most advanced models yet and the best in their class for multimodality. Llama 4 Scout โขย 17B-active-parameter model
0
0
1
Check out the #CVPR2025 paper on long video understanding. It achieves SOTA with a much simpler and efficient end-to-end approach.
๐New #CVPR2025 Paper๐ Introducing BIMBA, an efficient multimodal LLM for long-range video QA๐ก It sets SOTA on 7 VQA benchmarks by intelligently selecting key spatiotemporal tokens utilizing the selective scan mechanism of Mamba models. ๐งตThread below๐ https://t.co/yP9ZLkUX2N
0
1
2