Stanford Vision and Learning Lab
@StanfordSVL
Followers
16K
Following
329
Media
13
Statuses
364
SVL is led by @drfeifei @silviocinguetta @jcniebles @jiajunwu_cs and works on machine learning, computer vision, robotics and language
Stanford, CA
Joined September 2014
Introducing Ctrl-VI, a video sampling method allowing for a flexible set of user controls—ranging from coarse but easy-to-specify text prompts to precise camera/object trajectories. (1/n) https://t.co/ZajlgHQOG4
4
32
222
Super excited about this line of work! 🚀 A simple, scalable recipe for training diffusion language models using autoregressive models. We're releasing our tech report, model weights, and inference code!
Introducing RND1, the most powerful base diffusion language model (DLM) to date. RND1 (Radical Numerics Diffusion) is an experimental DLM with 30B params (3B active) with a sparse MoE architecture. We are making it open source, releasing weights, training details, and code to
0
9
67
Grafting Diffusion Transformers accepted to #NeurIPS2025 as an Oral! We have lots of interesting analysis, a test bed for model grafting, and insights🚀 📄Paper: https://t.co/OjsrOZi7in 🌎Website:
arxiv.org
Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these...
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6
7
38
209
#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing *WonderPlay*: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: https://t.co/uFOzA8t0P8 🧵1/7
6
42
184
(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.
5
70
305
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6
13
58
243
We'll be presenting Deep Schema Grounding at @iclr_conf 🇸🇬 on Thursday (session 1 #98). Come chat about abstract visual concepts, structured decomposition, & what makes a maze a maze! & test your models on our challenging Visual Abstractions Benchmark:
What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.
1
3
39
State classification of objects and their relations (e.g. the cup is next to the plate) is core to many tasks like robot planning and manipulation. But dynamic real-world environments often require models to generalize to novel predicates from few examples. We present PHIER, a
4
5
56
🔥Spatial intelligence requires world generation, and now we have the first comprehensive evaluation benchmark📏 for it! Introducing WorldScore: Unifying evaluation for 3D, 4D, and video models on world generation! 🧵1/7 Web: https://t.co/WnKPf8uarw arxiv: https://t.co/EPLM1xTLwP
6
89
242
🔥Want to capture 3D dancing fluids♨️🌫️🌪️💦? No specialized equipment, just one video! Introducing FluidNexus: Now you only need one camera to reconstruct 3D fluid dynamics and predict future evolution! 🧵1/4 Web: https://t.co/DsxWBo8pgX Arxiv: https://t.co/U1O8qpXycH
5
73
109
Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled
4
61
246
🤖 Ever wondered what robots need to truly help humans around the house? 🏡 Introducing 𝗕𝗘𝗛𝗔𝗩𝗜𝗢𝗥 𝗥𝗼𝗯𝗼𝘁 𝗦𝘂𝗶𝘁𝗲 (𝗕𝗥𝗦)—a comprehensive framework for mastering mobile whole-body manipulation across diverse household tasks! 🧹🫧 From taking out the trash to
18
138
418
🚀Two weeks ago, we hosted a welcome party for the newest member of our Stanford Vision and Learning Lab—a new robot! 🤖✨Watch as @drfeifei interacts with it in this fun video. Exciting release coming soon. Stay tuned! 👀🎉
9
27
212
Excited to bring back the 2nd Workshop on Visual Concepts at @CVPR 2025, this time with a call for papers! We welcome submissions on the following topics. See our website for more info: https://t.co/gk0NgYAcEx Join us & a fantastic lineup of speakers in Tennessee!
1
25
138
🤩Forget MoCap -- Let’s generate human interaction motions with *Real-world 3D scenes*!🏃🏞️ Introducing ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation. No training, No MoCap data! 🧵1/5 Web: https://t.co/Vn5TMJKvLf
11
63
270
1/ [NeurIPS D&B] Introducing HourVideo: A benchmark for hour-long video-language understanding!🚀 500 egocentric videos, 18 total tasks & ~13k questions! Performance: GPT-4➡️25.7% Gemini 1.5 Pro➡️37.3% Humans➡️85.0% We highlight a significant gap in multimodal capabilities🧵👇
3
53
185
💫🪑Introducing IKEA Manuals at Work: The first multimodal dataset with extensive 4D groundings of assembly in internet videos! We track furniture parts’ 6-DoF poses and segmentation masks through the assembly process, revealing how parts connect in both 2D and 3D space. With
5
44
172
[NeurIPS D&B Oral] Embodied Agent Interface: Benchmarking LLMs for Embodied Agents A single line of code to evaluate your model! 🌟Standardize Goal Specifications: LTL 🌟Standardize Modules and Interfaces: 4 modules, 438 tasks, 1475 goals 🌟Standardize Fine-grained Metrics: 18
5
68
279
Training RL/robot policies requires extensive experience in the target environment, which is often difficult to obtain. How can we “distill” embodied policies from foundational models? Introducing FactorSim! #NeurIPS2024 We show that by generating prompt-aligned simulations and
2
45
212
Why hand-engineer digital twins when digital cousins are free? Check out ACDC: Automated Creation of Digital Cousins 👭 for Robust Policy Learning, accepted at @corl2024! 🎉 📸 Single image -> 🏡 Interactive scene ⏩ Fully automatic (no annotations needed!) 🦾 Robot policies
11
40
162