XuDong Wang
@XDWang101
Followers
1K
Following
2K
Media
42
Statuses
131
Research Scientist @AIatMeta, Incoming Assistant Professor @DukeU | PhD from @Berkeley_AI @UCBerkeley | Prev: @GoogleDeepMind, FAIR @MetaAI
Berkeley, CA
Joined March 2020
🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: https://t.co/FJ4GAxPIkx ➡️Blog: https://t.co/3AyXBFBEmV
3
26
63
Excited to share our latest research from Meta FAIR: TV2TV 🚀 TV2TV is a unified model that interleaves – text reasoning (next-token) – video generation (next-frame) At inference, the model dynamically alternates between thinking in text and generating video! Our paper
arxiv.org
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should...
Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video
0
2
19
Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video
4
37
86
Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
2
16
136
Qianqian is an amazing researcher and will be a fantastic advisor. If you’re looking for a lab that values creativity, curiosity, and big ideas, keep an eye on hers! ✨ @QianqianWang5
I'm recruiting multiple PhD students this cycle to join me at Harvard University and the Kempner Institute! My interests span vision and intelligence, including 3D/4D, active perception, memory, representation learning, and anything you're excited to explore! Deadline: Dec 15th.
0
0
3
This work is impossible without the hard work of Junwei Yu and the support from @trevordarrell! Huge thanks to the team! Let us know what you build with UnSAMv2 — we can’t wait to see the new workflows this enables. 🚀 n/n
0
0
6
UnSAMv2 works across: • Interactive segmentation • Whole-image segmentation • Video segmentation (even though trained only on images!) • Any granularity from fine parts → full objects 7/n
1
0
3
And with only 6,000 unlabeled images, UnSAMv2 delivers strong gains over SAM-2: 📈 +16.5% NoC-90 📈 +26.0% 1-IoU 📈 +37.7% AR on whole-image segmentation Evaluated on 11+ benchmarks across objects, parts, and videos. 6/n
1
0
2
We leverage SAM-2 as base model and introduce a Fourier-based granularity encoding module & mask token. All this with just 0.02% extra parameters on top of SAM-2 — UnSAMv2 is almost free to add. 🚀 5/n
1
0
2
However, teaching a pre-trained segmentation model what granularity means is not trivial. We find that human-labeled datasets like SA-1B struggle for granularity learning because of strong annotator biases. 💡 Our key idea: Use a self-supervised, granularity-aware
1
0
2
Why granularity? Because “what is an object?” is not universal: Sometimes you want a part. Sometimes the whole. Sometimes something in between. UnSAMv2 gives users, not datasets, full control over objectness! 🎉 SAM & SAM-2 return up to 3 discrete masks. But prompts often
1
0
2
🚀 We open-sourced everything on "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity"! 🔥 Check out our project page and arXiv paper for more visualizations and technical insights! 👇 🔗 Code: https://t.co/3gmm8XyhvE 📷 arXiv:
github.com
Code release for "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity" - yujunwei04/UnSAMv2
1
0
2
Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨ 1 point + a continuous granularity slider = the mask you want! UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪 1/n
1
10
17
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
12
106
407
🚀 Introducing ECHO — an in-the-wild benchmark for image generation models! Old benchmarks miss what’s trending — new capabilities, creative use cases, and real-world prompts. ECHO turns social media discussions into structured evaluations!!! 💡 For example: is GPT-4o or
echo-bench.github.io
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
1
0
13
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
32
117
Generalization is the biggest problem for robotics right now. This includes generalization to unseen objects, environments, tasks… Our recent work shows that generalization to novel objects might not be *that* hard. Specifically, we show that robots, trained on **randomly
2
8
28
(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE
2
13
23
Congratulations to BAIR Researchers from @trevordarrell @pabbeel @JitendraMalikCV @akanazawa labs who won the Best Student Paper award at the just concluded #CoRL2025 in Seoul, Korea for their paper "Visual Imitation Enables Contextual Humanoid Control." @arthurallshire
Congratulations to the videomimic team for winning the best student paper award at CoRL 2025 🥹🎉 Grateful to the CoRL community for the recognition!
5
6
95
🙏 Huge thanks to my amazing co-authors Ji Xie, @trevordarrell @LukeZettlemoyer for their brilliant ideas and support — this work wouldn’t be possible without you! n/n
0
0
3