Ankit Goyal
@imankitgoyal
Followers
3K
Following
321
Media
38
Statuses
199
Foundation Models for Robotics, Nvidia Research, Princeton PhD
Seattle, WA
Joined March 2020
What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0:
18
70
533
Happy to share that the code for VLA-0 is out now: https://t.co/Vg8wsCSIPQ Given its simplicity, it’s a great starting point to try out VLAs!
github.com
VLA-0: Building State-of-the-Art VLAs with Zero Modification - NVlabs/vla0
What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0:
0
1
5
To my friends and family in India Please raise your voice and DEMAND clean air! It is your fundamental right. Think about the youngest member of your family. What have they done to lose years of their life just because they are born in India. Enough of ignorance.
0
0
9
The launch of the first humanoid for consumers, Neo-X, is truly exciting! Many are claiming this means robot learning is solved and that 1X has leapfrogged everyone else, but the real picture is much more nuanced. From a hardware and platform perspective, it looks incredibly
1
0
14
Had a great time guest lecturing in @YuXiang_IRVL's course on Vision-Language-Action (VLA) models. Check out the full recording 👇
Are you interested in Vision-Language-Action (VLA) models? We had an excellent guest lecture today by Ankit Goyal @imankitgoyal from NVIDIA on VLAs and their role in robot manipulation 🎥 Watch the recording here 👇 https://t.co/fhl15iKGr8 Slides: https://t.co/9lxXTXtX9r
1
7
53
🚀 Join us at #ICCV2025 for a full-day workshop: “Learning to See: Advancing Spatial Understanding for Embodied Intelligence” 🗓️ October 19 • 📷 Room 312 Meet our incredible lineup of speakers: @MattNiessner @jiadeng @pulkitology @KaterinaFragiad @YunzhuLiYZ @imankitgoyal
1
8
38
Huge thanks to my incredible collaborators: @HugoHadfield1, Xuning Yang, Valts Blukis, Fabio Ramos And the amazing teams at NVIDIA @NVIDIARobotics @NVIDIAAI @NVIDIAEmbedded If you're excited about simple, effective approaches to VLAs: 💻 Code:
github.com
VLA-0: Building State-of-the-Art VLAs with Zero Modification - NVlabs/vla0
2
1
32
How does such a simple architecture achieve this? It's all in the recipe 🔬 Three key techniques: 1️⃣ Action Decoding Represent actions as integers → Arbitrary resolution without changing vocabulary 2️⃣ Ensemble Prediction Average predictions across timesteps → Temporal
1
0
26
"Does it work on real robots?" YES. ✅ Tested on SO-101 arm: • Block reorientation • Object pushing • Pick & place Outperforms SmolVLA (+12.5 points). Notably: SmolVLA was pretrained on large-scale SO-100 data. VLA-0 trained from scratch on 100 demonstrations per task.
1
0
19
Introducing VLA-0 🚀 The entire architecture: Prompt a VLM to output actions as text. That's it. No new components. No change to VLM vocabulary. On LIBERO benchmark: → #1 among non-pretrained methods → Outperforms π₀.5-KI, OpenVLA-OFT, SmolVLA Even beats models pretrained
2
2
32
Looking for the latest and greatest in robotic policy learning? Check out👇ManiFlow — our new flow/diffusion-based method that combines algorithmic advances like consistency flow matching with architectural innovations such as DiT-X. It achieves very strong results in both sim &
Introduce ManiFlow 🤖, a visual imitation learning policy for general robot manipulation that is efficient, robust, and generalizable: - 98.3% improvement on 8 real-world tasks, generalizing to novel objects & backgrounds - Applied to diverse embodiments: single-arm, bimanual &
1
2
35
Senior Research Scientist: https://t.co/sAnqQAtr5O Research Scientist, New College Grad 2025: https://t.co/CtAUlumCKt Learn more about our team's work:
research.nvidia.com
NVIDIA Robotics
0
0
13
We, at NVIDIA's Seattle Robotics Research team, are hiring. 🤖 We are seeking Senior Research Scientists and New College Graduates (2025) to join us. Some areas of interest include: Vision-Language-Action (VLA) models & Bimanual and dextrous manipulation. This is a unique
8
17
331
4. Flowing from Words to Pixels An insight that seems so simple in hindsight. For conditional generation, instead of starting from noise, why not flow directly from source to the target distri.? I'll be watching closely if this becomes the norm. Great Work by @Qihao Liu et al.
0
0
2
3. Prompting Depth Anything for 4K Metric Depth It’s a very practical way to get dense and accurate metric depth. It upgrades a monocular depth models for metric accuracy by using data from metric sensors, getting the best of both worlds. Great work by Haotong Lin et al.
1
0
2
2. Reconstructing Animals and the Wild This work generates complete scenes from natural images, trained with just synthetic Infinigen data. While working on Infinigen, I never thought it could be used so creatively. Fantastic work by Peter @Michael_J_Black @silvia_zuffi
2
0
2
That’s a wrap for #CVPR2025! Here's a 🧵 of some really cool works 👇 1. Let Humanoids Hike! Great work @ky_lin0305 and Stella Xu. They drove home the point that we can't treat locomotion and navigation as separate. The ultimate test: Can your robot complete a hike on its own?
1
0
14