XuDong Wang @XDWang101 X Profile

XuDong Wang

@XDWang101

Followers

1K

Following

2K

Media

42

Statuses

131

Research Scientist @AIatMeta, Incoming Assistant Professor @DukeU | PhD from @Berkeley_AI @UCBerkeley | Prev: @GoogleDeepMind, FAIR @MetaAI

https://t.co/MuoQNmtgHP

Berkeley, CA

Joined March 2020

Don't wanna be here? Send us removal request.

Lisa Dunlap

@lisabdunlap

11 hours

🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: https://t.co/FJ4GAxPIkx ➡️Blog: https://t.co/3AyXBFBEmV

3

26

63

XuDong Wang

@XDWang101

7 days

Excited to share our latest research from Meta FAIR: TV2TV 🚀 TV2TV is a unified model that interleaves – text reasoning (next-token) – video generation (next-frame) At inference, the model dynamically alternates between thinking in text and generating video! Our paper

arxiv.org

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should...

Xiaochuang Han

@XiaochuangHan

7 days

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video

0

2

19

Xiaochuang Han

@XiaochuangHan

7 days

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video

4

37

86

Tengda Han

@TengdaHan

14 days

Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.

2

16

136

XuDong Wang

@XDWang101

12 days

Qianqian is an amazing researcher and will be a fantastic advisor. If you’re looking for a lab that values creativity, curiosity, and big ideas, keep an eye on hers! ✨ @QianqianWang5

Qianqian Wang

@QianqianWang5

13 days

I'm recruiting multiple PhD students this cycle to join me at Harvard University and the Kempner Institute! My interests span vision and intelligence, including 3D/4D, active perception, memory, representation learning, and anything you're excited to explore! Deadline: Dec 15th.

0

3

XuDong Wang

@XDWang101

25 days

This work is impossible without the hard work of Junwei Yu and the support from @trevordarrell! Huge thanks to the team! Let us know what you build with UnSAMv2 — we can’t wait to see the new workflows this enables. 🚀 n/n

0

6

XuDong Wang

@XDWang101

25 days

UnSAMv2 works across: • Interactive segmentation • Whole-image segmentation • Video segmentation (even though trained only on images!) • Any granularity from fine parts → full objects 7/n

1

0

3

XuDong Wang

@XDWang101

25 days

And with only 6,000 unlabeled images, UnSAMv2 delivers strong gains over SAM-2: 📈 +16.5% NoC-90 📈 +26.0% 1-IoU 📈 +37.7% AR on whole-image segmentation Evaluated on 11+ benchmarks across objects, parts, and videos. 6/n

1

0

2

XuDong Wang

@XDWang101

25 days

We leverage SAM-2 as base model and introduce a Fourier-based granularity encoding module & mask token. All this with just 0.02% extra parameters on top of SAM-2 — UnSAMv2 is almost free to add. 🚀 5/n

1

0

2

XuDong Wang

@XDWang101

25 days

However, teaching a pre-trained segmentation model what granularity means is not trivial. We find that human-labeled datasets like SA-1B struggle for granularity learning because of strong annotator biases. 💡 Our key idea: Use a self-supervised, granularity-aware

1

0

2

XuDong Wang

@XDWang101

25 days

Why granularity? Because “what is an object?” is not universal: Sometimes you want a part. Sometimes the whole. Sometimes something in between. UnSAMv2 gives users, not datasets, full control over objectness! 🎉 SAM & SAM-2 return up to 3 discrete masks. But prompts often

1

0

2

XuDong Wang

@XDWang101

25 days

🚀 We open-sourced everything on "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity"! 🔥 Check out our project page and arXiv paper for more visualizations and technical insights! 👇 🔗 Code: https://t.co/3gmm8XyhvE 📷 arXiv:

github.com

Code release for "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity" - yujunwei04/UnSAMv2

1

0

2

XuDong Wang

@XDWang101

25 days

Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨ 1 point + a continuous granularity slider = the mask you want! UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪 1/n

1

10

17

Ritwik Gupta 🇺🇦

@Ritwik_G

1 month

I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵

12

106

407

XuDong Wang

@XDWang101

2 months

🚀 Introducing ECHO — an in-the-wild benchmark for image generation models! Old benchmarks miss what’s trending — new capabilities, creative use cases, and real-world prompts. ECHO turns social media discussions into structured evaluations!!! 💡 For example: is GPT-4o or

echo-bench.github.io

Jiaxin Ge

@aomaru_21490

2 months

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ

1

0

13

Jiaxin Ge

@aomaru_21490

2 months

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ

3

32

117

Baifeng

@baifeng_shi

2 months

Generalization is the biggest problem for robotics right now. This includes generalization to unseen objects, environments, tasks… Our recent work shows that generalization to novel objects might not be *that* hard. Specifically, we show that robots, trained on **randomly

2

8

28

Kevin Li

@curiouskid423

2 months

(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE

2

13

23

Berkeley AI Research

@berkeley_ai

3 months

Congratulations to BAIR Researchers from @trevordarrell @pabbeel @JitendraMalikCV @akanazawa labs who won the Best Student Paper award at the just concluded #CoRL2025 in Seoul, Korea for their paper "Visual Imitation Enables Contextual Humanoid Control." @arthurallshire

Angjoo Kanazawa

@akanazawa

3 months

Congratulations to the videomimic team for winning the best student paper award at CoRL 2025 🥹🎉 Grateful to the CoRL community for the recognition!

5

6

95

XuDong Wang

@XDWang101

3 months

🙏 Huge thanks to my amazing co-authors Ji Xie, @trevordarrell @LukeZettlemoyer for their brilliant ideas and support — this work wouldn’t be possible without you! n/n

0

3