Ajay Sridhar
@ajaysridhar0
Followers
298
Following
570
Media
6
Statuses
34
cs phd student @StanfordAILab
Joined June 2023
VLAs are great, but most lack long-term memory humans use for everyday tasks. This is a critical gap for solving complex, long-horizon problems. Introducing MemER: Scaling Up Memory for Robot Control via Experience Retrieval. A thread 🧵 (1/8)
5
42
306
How can we create a single navigation policy that works for different robots in diverse environments AND can reach navigation goals with high precision? Happy to share our new paper, "VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable
4
40
120
Had fun working on this project with my co-lead @jenpan_, @satviks107, and @chelseabfinn! Paper: https://t.co/2eRYdwzvbS Website: https://t.co/z8c6SeGaIy (8/8)
arxiv.org
Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation...
0
2
9
Robots need memory to handle complex, multi-step tasks. Can we design an effective method for this? We propose MemER, a hierarchical VLA policy that learns what visual frames to remember across multiple long-horizon tasks, enabling memory-aware manipulation. (1/5)
2
33
172
What about just using a massive proprietary VLM like GPT-5 as the high-level policy? 1. Latency: At 10-15 seconds, they are far too slow for real-time robot control. 2. Accuracy: Even in an offline test, they were significantly less accurate than our finetuned model. (7/8)
1
0
9
Does memory have to be visual? We evaluated storing text (past subtasks) instead. Relying on text memory (Short History + Text) led to a significant performance drop. Even adding text to our visual memory (MemER + Text) hurt, as the model over-indexed on the text. (6/8)
1
0
7
Baselines with no history or short history (8 frames) perform poorly. Naively using a long history (32 frames) helps but scales poorly and still performs worse than our method. MemER, by selectively retrieving keyframes, matches the oracle human HL policy. (5/8)
1
0
7
We finetune Qwen2.5-VL-7B-Instruct as the high-level policy and pi0.5 as the low-level policy. With just 50 demos per task, MemER solve 3 real-world long-horizon tasks requiring minutes memory: Object search, Counting, and Dust & Replace. (4/8)
1
0
9
MemER is hierarchical: 1️⃣ A High-Level Policy (VLM) manages all memory. It watches recent frames & selects task-relevant "keyframes" to store. 2️⃣ It uses this memory to predict language subtasks. 3️⃣ A Low-Level Policy (VLA) gets no memory & just executes the subtask. (3/8)
1
1
17
Why not feed the policy with a long video history? This computationally expensive and brittle under the covariate shift during deployment. Key idea: The most relevant information is often in just a few keyframes. (2/8)
1
0
14
Rollouts in the real world are slow and expensive. What if we could rollout trajectories entirely inside a world model (WM)? Introducing 🚀Ctrl-World🚀, a generative manipulation WM that can interact with advanced VLA policy in imagination. 🧵1/6
5
40
210
We trained OmniVLA, a robotic foundation model for navigation conditioned on language, goal poses, and images. Initialized with OpenVLA, it leverages Internet-scale knowledge for strong OOD performance. Great collaboration with @CatGlossop, @shahdhruv_, and @svlevine.
6
64
328
Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks
18
136
1K
Can we teach dexterous robot hands manipulation without human demos or hand-crafted rewards? Our key insight: Use Vision-Language Models (VLMs) to scaffold coarse motion plans, then train an RL agent to execute them with 3D keypoints as the interface. 1/7
1
14
64
Meet ProVox: a proactive robot teammate that gets you 🤖❤️🔥 ProVox models your goals and expectations before a task starts — enabling personalized, proactive help for smoother, more natural collaboration. All powered by LLM commonsense. Recently accepted at @ieeeras R-AL! 🧵1/7
3
14
69
Can we collect robot dexterous hand data directly with human hand? Introducing DexUMI: 0 teleoperation and 0 re-targeting dexterous hand data collection system → autonomously complete precise, long-horizon and contact-rich tasks Project Page: https://t.co/z3CUS5wgMx
9
51
254
Robotic models are advancing rapidly—but how do we scale their improvement? 🤖 We propose a recipe for batch online RL (train offline with online rollouts) that enables policies to self-improve without complications of online RL More: https://t.co/vr1IZqbAiq (1/8)
1
16
110
We released our most recent research for learning generalized visual navigation policy from scalable, but low-quality and action-free passive data sources. Our policy can navigate various robots in diverse environments. web: https://t.co/ghjmJp7f8K paper: https://t.co/LzaMKWMflA
7
9
36
Trained a robot policy and want to measure generalization? Generalization evals vary across studies, and this makes progress hard to track. Enter ★-Gen: a taxonomy of generalization for manipulation that guides evaluation design and fosters new benchmarks. 🧵⬇️ 1/8
3
15
104
Current RL finetuning methods are too inefficient to make autonomous real world robot learning tractable. We propose Simulation-Guided Fine-Tuning (SGFT) - a simple, general sim2real framework that extracts structured exploration priors from sim to accelerate real world RL. 🧵1/6
1
26
118