XDWang101 Profile Banner
XuDong Wang Profile
XuDong Wang

@XDWang101

Followers
1K
Following
2K
Media
42
Statuses
131

Research Scientist @AIatMeta, Incoming Assistant Professor @DukeU | PhD from @Berkeley_AI @UCBerkeley | Prev: @GoogleDeepMind, FAIR @MetaAI

Berkeley, CA
Joined March 2020
Don't wanna be here? Send us removal request.
@lisabdunlap
Lisa Dunlap
11 hours
🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: https://t.co/FJ4GAxPIkx ➡️Blog: https://t.co/3AyXBFBEmV
3
26
63
@XDWang101
XuDong Wang
7 days
Excited to share our latest research from Meta FAIR: TV2TV 🚀 TV2TV is a unified model that interleaves – text reasoning (next-token) – video generation (next-frame) At inference, the model dynamically alternates between thinking in text and generating video! Our paper
Tweet card summary image
arxiv.org
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should...
@XiaochuangHan
Xiaochuang Han
7 days
Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video
0
2
19
@XiaochuangHan
Xiaochuang Han
7 days
Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video
4
37
86
@TengdaHan
Tengda Han
14 days
Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
2
16
136
@XDWang101
XuDong Wang
12 days
Qianqian is an amazing researcher and will be a fantastic advisor. If you’re looking for a lab that values creativity, curiosity, and big ideas, keep an eye on hers! ✨ @QianqianWang5
@QianqianWang5
Qianqian Wang
13 days
I'm recruiting multiple PhD students this cycle to join me at Harvard University and the Kempner Institute! My interests span vision and intelligence, including 3D/4D, active perception, memory, representation learning, and anything you're excited to explore! Deadline: Dec 15th.
0
0
3
@XDWang101
XuDong Wang
25 days
This work is impossible without the hard work of Junwei Yu and the support from @trevordarrell! Huge thanks to the team! Let us know what you build with UnSAMv2 — we can’t wait to see the new workflows this enables. 🚀 n/n
0
0
6
@XDWang101
XuDong Wang
25 days
UnSAMv2 works across: • Interactive segmentation • Whole-image segmentation • Video segmentation (even though trained only on images!) • Any granularity from fine parts → full objects 7/n
1
0
3
@XDWang101
XuDong Wang
25 days
And with only 6,000 unlabeled images, UnSAMv2 delivers strong gains over SAM-2: 📈 +16.5% NoC-90 📈 +26.0% 1-IoU 📈 +37.7% AR on whole-image segmentation Evaluated on 11+ benchmarks across objects, parts, and videos. 6/n
1
0
2
@XDWang101
XuDong Wang
25 days
We leverage SAM-2 as base model and introduce a Fourier-based granularity encoding module & mask token. All this with just 0.02% extra parameters on top of SAM-2 — UnSAMv2 is almost free to add. 🚀 5/n
1
0
2
@XDWang101
XuDong Wang
25 days
However, teaching a pre-trained segmentation model what granularity means is not trivial. We find that human-labeled datasets like SA-1B struggle for granularity learning because of strong annotator biases. 💡 Our key idea: Use a self-supervised, granularity-aware
1
0
2
@XDWang101
XuDong Wang
25 days
Why granularity? Because “what is an object?” is not universal: Sometimes you want a part. Sometimes the whole. Sometimes something in between. UnSAMv2 gives users, not datasets, full control over objectness! 🎉 SAM & SAM-2 return up to 3 discrete masks. But prompts often
1
0
2
@XDWang101
XuDong Wang
25 days
🚀 We open-sourced everything on "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity"! 🔥 Check out our project page and arXiv paper for more visualizations and technical insights! 👇 🔗 Code: https://t.co/3gmm8XyhvE 📷 arXiv:
Tweet card summary image
github.com
Code release for "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity" - yujunwei04/UnSAMv2
1
0
2
@XDWang101
XuDong Wang
25 days
Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨ 1 point + a continuous granularity slider = the mask you want! UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪 1/n
1
10
17
@Ritwik_G
Ritwik Gupta 🇺🇦
1 month
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
12
106
407
@XDWang101
XuDong Wang
2 months
🚀 Introducing ECHO — an in-the-wild benchmark for image generation models! Old benchmarks miss what’s trending — new capabilities, creative use cases, and real-world prompts. ECHO turns social media discussions into structured evaluations!!! 💡 For example: is GPT-4o or
echo-bench.github.io
@aomaru_21490
Jiaxin Ge
2 months
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
1
0
13
@aomaru_21490
Jiaxin Ge
2 months
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
32
117
@baifeng_shi
Baifeng
2 months
Generalization is the biggest problem for robotics right now. This includes generalization to unseen objects, environments, tasks… Our recent work shows that generalization to novel objects might not be *that* hard. Specifically, we show that robots, trained on **randomly
2
8
28
@curiouskid423
Kevin Li
2 months
(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE
2
13
23
@berkeley_ai
Berkeley AI Research
3 months
Congratulations to BAIR Researchers from @trevordarrell @pabbeel @JitendraMalikCV @akanazawa labs who won the Best Student Paper award at the just concluded #CoRL2025 in Seoul, Korea for their paper "Visual Imitation Enables Contextual Humanoid Control." @arthurallshire
@akanazawa
Angjoo Kanazawa
3 months
Congratulations to the videomimic team for winning the best student paper award at CoRL 2025 🥹🎉 Grateful to the CoRL community for the recognition!
5
6
95
@XDWang101
XuDong Wang
3 months
🙏 Huge thanks to my amazing co-authors Ji Xie, @trevordarrell @LukeZettlemoyer for their brilliant ideas and support — this work wouldn’t be possible without you! n/n
0
0
3