Jilan Xu
@JazzzCharles
Followers
16
Following
81
Media
6
Statuses
9
PhD Student working on vision-language models, video understanding and medical image analysis
Joined June 2022
iv) We show that causal temporal attention is extremely computational efficient while equipped with KV-cache, saving both latency time and GPU memory. Our multitasking training also shows data efficiency comparing to conventional video-text contrastive learning.
0
0
0
iii) Experimental results showed that StreamFormer achieves competitive results among a series of downstream tasks including Online Action Detection, Online Video Instance Segmentation and VideoQA.
1
0
0
ii) We unify spatiotemporal tasks under a multitask learning framework, with three objectives: (1) capturing global semantics by video-level supervision (2) modeling temporal dynamics with frame-level supervision (3) achieving spatial precision via pixel-wise supervision
1
0
0
i) StreamFormer is a Transformer-based video backbone that accepts continuous video streams. It uses divided space-time attention, including: (1) pretrained spatial attention with LoRA representation (2) temporal attention with causal masking for streaming video modeling.
0
0
0
๐ Excited to announce that StreamFormer is accepted to #ICCV2025 as an Oral paper! Congrats to my excellent coauthors @Anxiou51 @WeidiXie "Learning Streaming Video Representation via Multitask Training"! ๐Paper: https://t.co/h8OUpPaIVq ๐Web: https://t.co/z2geDGRNXr
2
2
4
OLMoE: Open Mixture-of-Experts Language Models abs: https://t.co/68jxJS6GLc model: https://t.co/h4tLvJRZqv "OLMOE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMOE-1B-7B-INSTRUCT. Our
1
49
251