Jacky Kwok
@jackyk02
Followers
123
Following
46
Media
7
Statuses
16
Stanford CS PhD | Berkeley EECS
Palo Alto, CA
Joined June 2025
✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website:
2
17
72
Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW):
46
135
418
Excited to unveil @nvidia's latest work on #Reasoning Vision–Language–Action (#VLA) models — Alpamayo-R1! Alpamayo-R1 is a new #reasoning VLA architecture featuring a diffusion-based action expert built on top of the #Cosmos-#Reason backbone. It represents one of the core
nvidianews.nvidia.com
NVIDIA today announced it is partnering with Uber to scale the world’s largest level 4-ready mobility network, using the company’s next-generation robotaxi and autonomous delivery fleets, the new...
10
39
234
Thrilled to share that 🤖🐒 RoboMonkey is accepted to #CoRL2025 !! See you in Seoul 🇰🇷
✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website:
0
1
21
Happy to share RoboMonkey, a framework for synthetic data generation + scaling test time compute for VLAs: Turns out generation (via repeated sampling) and verification (via training a verifier on synthetic data) works well for robotics too! Training the verifier: we sample N
5
32
164
This work was an awesome collaboration between Stanford, UC Berkeley, and NVIDIA. It was made possible by an incredible team: @agiachris @RohanSinhaSU @MatthewFoutter @depetrol1 and amazing advisors: @drmapavone @Azaliamirh @istoica05
0
0
7
📋 Takeaways Rather than framing robot control as a generation problem, we suggest that viewing it through the lens of sampling and verification—generating diverse action candidates and verifying them—can be an effective path towards general-purpose robotics foundation models.
1
0
6
🧵(9 / N) To enable practical deployment for test-time scaling, we implemented a VLA serving engine on top of SGLang to speed up 🚀 repeated sampling of initial action candidates and employ Gaussian perturbation to efficiently construct an action proposal distribution.
1
0
5
🧵(8 / N) Scaling the synthetic dataset size (number of action comparisons) 📈 consistently improves the performance of the RoboMonkey verifier, leading to higher closed-loop success rates on SIMPLER.
1
0
6
🧵(7 / N) We find that RoboMonkey effectively mitigates issues of imprecise grasping, task progression failures, and collisions at deployment. Detailed task breakdowns and failure analysis are provided on our project website: https://t.co/xFKjDxRcRD.
1
2
9
🧵(6 / N) Eval: We show that pairing existing VLAs with RoboMonkey yields significant performance gains 🦾 achieving a 25% absolute improvement on real-world out-of-distribution tasks, 9% on in-distribution SIMPLER environments, and 7% on LIBERO-Long benchmark.
1
0
6
🧵(5 / N) Scaling: At deployment, we sample a small batch of actions from a policy. We use Gaussian perturbation and majority voting to efficiently generate more action candidates based on the initial samples. Finally, the VLM-based verifier is used to select the optimal action.
1
0
5
🧵(4 / N) Training: Given a robotics dataset, we sample N actions per state from a policy. We construct synthetic action preferences based on the RMSE between each sampled action and the ground-truth action. This dataset is then used to fine-tune a VLM-based action verifier.
1
0
7
🧵(3 / N) Core Questions: - 1️⃣ Can we capitalize on these scaling laws with a learned action verifier to improve policy robustness? - 2️⃣ Can we scale synthetic data to improve verification and downstream tasks? - 3️⃣ How do we enable practical deployment for test-time scaling?
1
0
6
🧵(2 / N) Test-time scaling law for VLAs: We observe that action error consistently decreases 📉 as we scale the number of generated actions. Repeatedly sampling actions from robot policies, applying Gaussian perturbation to a few sampled actions, and even random sampling of
1
0
7