David Fan Profile
David Fan

@DavidJFan

Followers
576
Following
319
Media
25
Statuses
86

Facebook AI Research (FAIR) | Video Representations, Self-Supervised Learning | @Princeton Computer Science '19

New York City
Joined June 2013
Don't wanna be here? Send us removal request.
@DavidJFan
David Fan
4 months
Can visual SSL match CLIP on VQA?. Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
Tweet media one
12
95
458
@DavidJFan
David Fan
9 days
RT @hou_char: Come by the poster if you want recommendations on cool restaurants to try in Vancouver 😃!.
0
2
0
@DavidJFan
David Fan
1 month
Congrats @jianyuan_wang and co!!!.
@Oxford_VGG
Visual Geometry Group (VGG)
1 month
Many Congratulations to @jianyuan_wang, @MinghaoChen23, @n_karaev, Andrea Vedaldi, Christian Rupprecht and @davnov134 for winning the Best Paper Award @CVPR for "VGGT: Visual Geometry Grounded Transformer" 🥇🎉 🙌🙌 #CVPR2025!!!!!!
Tweet media one
Tweet media two
0
0
1
@DavidJFan
David Fan
3 months
Welcome Rob! So blessed to have you steer the ship! See you around the office :).
@rob_fergus
Rob Fergus
3 months
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
0
0
11
@DavidJFan
David Fan
3 months
RT @cindy_x_wu: Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimo….
0
45
0
@DavidJFan
David Fan
3 months
Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see HuggingFace collection:.
Tweet card summary image
github.com
Code for "Scaling Language-Free Visual Representation Learning" paper (Web-SSL). - facebookresearch/webssl
0
1
11
@DavidJFan
David Fan
3 months
Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and.
@TongPetersb
Peter Tong
3 months
We're open-sourcing the training code for MetaMorph!. MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and.
0
1
23
@DavidJFan
David Fan
3 months
RT @_amirbar: Excited to share that our paper on Navigation World Models was selected for an Oral presentation at CVPR!.Code & models:. htt….
Tweet card summary image
huggingface.co
0
7
0
@DavidJFan
David Fan
4 months
[8/8] For more details, please check out the paper and website. Paper: Website: It was a great privilege and so much fun to work with @TongPetersb, @JiachenAI, @koustuvsinha, @liuzhuang1234, @endernewton, Michael Rabbat, Nicolas.
Tweet card summary image
arxiv.org
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is...
2
3
27
@DavidJFan
David Fan
4 months
[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we.
2
3
63
@DavidJFan
David Fan
4 months
[6/8] Developing higher resolution SSL models is a promising direction. High resolution versions of Web-DINO 7B can nearly match the performance of SOTA off-shelf CLIP-family models such as SigLIP 2, despite seeing only on images.
1
0
14
@DavidJFan
David Fan
4 months
[5/8] The observed scaling behavior is not unique to DINOv2; other visual SSL methods such as MAE also show similar potential!
Tweet media one
1
0
14
@DavidJFan
David Fan
4 months
[4/8] Web-SSL also performs well on classic vision tasks and is competitive with the original DINOv2. This means we are getting closer to developing vision encoders that excel at both pure vision and multimodal capabilities!
Tweet media one
1
1
18
@DavidJFan
David Fan
4 months
[3/8] Why can SSL learn text-sensitive features purely from images?.Hypothesis: Web images contain text. Observation: Training Web-DINO on text-heavy image subsets drastically improves OCR/Chart VQA performance, matching CLIP trained on the full data.
Tweet media one
1
0
19
@DavidJFan
David Fan
4 months
[2/8] Web-DINO (SSL) scales better with model size from 1B - 7B params than CLIP, especially on OCR/Chart VQA – proving that vision-only models can also perform well on tasks that were traditionally dominated by CLIP.
Tweet media one
1
0
20
@DavidJFan
David Fan
4 months
[1/8] We trained SSL (Web-DINO) and CLIP on the same web-scale dataset (MC-2B images/image-text pairs) and evaluated fairly on VQA by freezing the vision encoders. Our apples-to-apples comparisons enable novel insights into the scaling behavior of SSL in this new data regime.
Tweet media one
1
0
20
@DavidJFan
David Fan
4 months
RT @liuzhuang1234: New paper - . Transformers, but without normalization layers (1/n)
Tweet media one
0
600
0
@DavidJFan
David Fan
7 months
@TongPetersb @JiachenAI @YoungXiong1 @endernewton @koustuvsinha @ylecun @sainingxie @liuzhuang1234 Update: @askalphaxiv generously decided to feature MetaMorph! We will be answering questions for a week, so feel free to comment on:
0
2
5