Stanford Vision and Learning Lab @StanfordSVL X Profile

Stanford Vision and Learning Lab

@StanfordSVL

Followers

16K

Following

329

Media

13

Statuses

364

SVL is led by @drfeifei @silviocinguetta @jcniebles @jiajunwu_cs and works on machine learning, computer vision, robotics and language

https://t.co/6OQ8bu3R55

Stanford, CA

Joined September 2014

Don't wanna be here? Send us removal request.

Yunzhi Zhang

@zhang_yunzhi

1 month

Introducing Ctrl-VI, a video sampling method allowing for a flexible set of user controls—ranging from coarse but easy-to-specify text prompts to precise camera/object trajectories. (1/n) https://t.co/ZajlgHQOG4

4

32

222

Keshigeyan Chandrasegaran

@keshigeyan

1 month

Super excited about this line of work! 🚀 A simple, scalable recipe for training diffusion language models using autoregressive models. We're releasing our tech report, model weights, and inference code!

Radical Numerics

@RadicalNumerics

1 month

Introducing RND1, the most powerful base diffusion language model (DLM) to date. RND1 (Radical Numerics Diffusion) is an experimental DLM with 30B params (3B active) with a sparse MoE architecture. We are making it open source, releasing weights, training details, and code to

0

9

67

Keshigeyan Chandrasegaran

@keshigeyan

2 months

Grafting Diffusion Transformers accepted to #NeurIPS2025 as an Oral! We have lots of interesting analysis, a test bed for model grafting, and insights🚀 📄Paper: https://t.co/OjsrOZi7in 🌎Website:

arxiv.org

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these...

Keshigeyan Chandrasegaran

@keshigeyan

6 months

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6

7

38

209

Hong-Xing (Koven) Yu

@Koven_Yu

5 months

#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing *WonderPlay*: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: https://t.co/uFOzA8t0P8 🧵1/7

6

42

184

Yunzhi Zhang

@zhang_yunzhi

5 months

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

5

70

305

Keshigeyan Chandrasegaran

@keshigeyan

6 months

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6

13

58

243

Joy Hsu

@joycjhsu

7 months

We'll be presenting Deep Schema Grounding at @iclr_conf 🇸🇬 on Thursday (session 1 #98). Come chat about abstract visual concepts, structured decomposition, & what makes a maze a maze! & test your models on our challenging Visual Abstractions Benchmark:

Joy Hsu

@joycjhsu

1 year

What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.

1

3

39

Emily Jin

@emilyzjin

7 months

State classification of objects and their relations (e.g. the cup is next to the plate) is core to many tasks like robot planning and manipulation. But dynamic real-world environments often require models to generalize to novel predicates from few examples. We present PHIER, a

4

5

56

Hong-Xing (Koven) Yu

@Koven_Yu

8 months

🔥Spatial intelligence requires world generation, and now we have the first comprehensive evaluation benchmark📏 for it! Introducing WorldScore: Unifying evaluation for 3D, 4D, and video models on world generation! 🧵1/7 Web: https://t.co/WnKPf8uarw arxiv: https://t.co/EPLM1xTLwP

6

89

242

Hong-Xing (Koven) Yu

@Koven_Yu

8 months

🔥Want to capture 3D dancing fluids♨️🌫️🌪️💦? No specialized equipment, just one video! Introducing FluidNexus: Now you only need one camera to reconstruct 3D fluid dynamics and predict future evolution! 🧵1/4 Web: https://t.co/DsxWBo8pgX Arxiv: https://t.co/U1O8qpXycH

5

73

109

Fan-Yun Sun

@sunfanyun

8 months

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled

4

61

246

Yunfan Jiang

@YunfanJiang

9 months

🤖 Ever wondered what robots need to truly help humans around the house? 🏡 Introducing 𝗕𝗘𝗛𝗔𝗩𝗜𝗢𝗥 𝗥𝗼𝗯𝗼𝘁 𝗦𝘂𝗶𝘁𝗲 (𝗕𝗥𝗦)—a comprehensive framework for mastering mobile whole-body manipulation across diverse household tasks! 🧹🫧 From taking out the trash to

18

138

418

Yunfan Jiang

@YunfanJiang

9 months

🚀Two weeks ago, we hosted a welcome party for the newest member of our Stanford Vision and Learning Lab—a new robot! 🤖✨Watch as @drfeifei interacts with it in this fun video. Exciting release coming soon. Stay tuned! 👀🎉

9

27

212

Joy Hsu

@joycjhsu

10 months

Excited to bring back the 2nd Workshop on Visual Concepts at @CVPR 2025, this time with a call for papers! We welcome submissions on the following topics. See our website for more info: https://t.co/gk0NgYAcEx Join us & a fantastic lineup of speakers in Tennessee!

1

25

138

Hong-Xing (Koven) Yu

@Koven_Yu

11 months

🤩Forget MoCap -- Let’s generate human interaction motions with *Real-world 3D scenes*!🏃🏞️ Introducing ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation. No training, No MoCap data! 🧵1/5 Web: https://t.co/Vn5TMJKvLf

11

63

270

Keshigeyan Chandrasegaran

@keshigeyan

1 year

1/ [NeurIPS D&B] Introducing HourVideo: A benchmark for hour-long video-language understanding!🚀 500 egocentric videos, 18 total tasks & ~13k questions! Performance: GPT-4➡️25.7% Gemini 1.5 Pro➡️37.3% Humans➡️85.0% We highlight a significant gap in multimodal capabilities🧵👇

3

53

185

Yunong Liu

@yunongliu1

1 year

💫🪑Introducing IKEA Manuals at Work: The first multimodal dataset with extensive 4D groundings of assembly in internet videos! We track furniture parts’ 6-DoF poses and segmentation masks through the assembly process, revealing how parts connect in both 2D and 3D space. With

5

44

172

Manling Li

@ManlingLi_

1 year

[NeurIPS D&B Oral] Embodied Agent Interface: Benchmarking LLMs for Embodied Agents A single line of code to evaluate your model! 🌟Standardize Goal Specifications: LTL 🌟Standardize Modules and Interfaces: 4 modules, 438 tasks, 1475 goals 🌟Standardize Fine-grained Metrics: 18

5

68

279

Fan-Yun Sun

@sunfanyun

1 year

Training RL/robot policies requires extensive experience in the target environment, which is often difficult to obtain. How can we “distill” embodied policies from foundational models? Introducing FactorSim! #NeurIPS2024 We show that by generating prompt-aligned simulations and

2

45

212

Tianyuan Dai

@RogerDai1217

1 year

Why hand-engineer digital twins when digital cousins are free? Check out ACDC: Automated Creation of Digital Cousins 👭 for Robust Policy Learning, accepted at @corl2024! 🎉 📸 Single image -> 🏡 Interactive scene ⏩ Fully automatic (no annotations needed!) 🦾 Robot policies

11

40

162