Ajay Sridhar @ajaysridhar0 X Profile

Ajay Sridhar

@ajaysridhar0

Followers

298

Following

570

Media

6

Statuses

34

cs phd student @StanfordAILab

https://t.co/mNmVVOtr0f

Joined June 2023

Don't wanna be here? Send us removal request.

Ajay Sridhar

@ajaysridhar0

24 days

VLAs are great, but most lack long-term memory humans use for everyday tasks. This is a critical gap for solving complex, long-horizon problems. Introducing MemER: Scaling Up Memory for Robot Control via Experience Retrieval. A thread 🧵 (1/8)

5

42

306

Mateo Guaman Castro

@mateoguaman

22 days

How can we create a single navigation policy that works for different robots in diverse environments AND can reach navigation goals with high precision? Happy to share our new paper, "VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable

4

40

120

Ajay Sridhar

@ajaysridhar0

24 days

Had fun working on this project with my co-lead @jenpan_, @satviks107, and @chelseabfinn! Paper: https://t.co/2eRYdwzvbS Website: https://t.co/z8c6SeGaIy (8/8)

arxiv.org

Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation...

0

2

9

Jenny Pan

@jenpan_

24 days

Robots need memory to handle complex, multi-step tasks. Can we design an effective method for this? We propose MemER, a hierarchical VLA policy that learns what visual frames to remember across multiple long-horizon tasks, enabling memory-aware manipulation. (1/5)

2

33

172

Ajay Sridhar

@ajaysridhar0

24 days

What about just using a massive proprietary VLM like GPT-5 as the high-level policy? 1. Latency: At 10-15 seconds, they are far too slow for real-time robot control. 2. Accuracy: Even in an offline test, they were significantly less accurate than our finetuned model. (7/8)

1

0

9

Ajay Sridhar

@ajaysridhar0

24 days

Does memory have to be visual? We evaluated storing text (past subtasks) instead. Relying on text memory (Short History + Text) led to a significant performance drop. Even adding text to our visual memory (MemER + Text) hurt, as the model over-indexed on the text. (6/8)

1

0

7

Ajay Sridhar

@ajaysridhar0

24 days

Baselines with no history or short history (8 frames) perform poorly. Naively using a long history (32 frames) helps but scales poorly and still performs worse than our method. MemER, by selectively retrieving keyframes, matches the oracle human HL policy. (5/8)

1

0

7

Ajay Sridhar

@ajaysridhar0

24 days

We finetune Qwen2.5-VL-7B-Instruct as the high-level policy and pi0.5 as the low-level policy. With just 50 demos per task, MemER solve 3 real-world long-horizon tasks requiring minutes memory: Object search, Counting, and Dust & Replace. (4/8)

1

0

9

Ajay Sridhar

@ajaysridhar0

24 days

MemER is hierarchical: 1️⃣ A High-Level Policy (VLM) manages all memory. It watches recent frames & selects task-relevant "keyframes" to store. 2️⃣ It uses this memory to predict language subtasks. 3️⃣ A Low-Level Policy (VLA) gets no memory & just executes the subtask. (3/8)

1

17

Ajay Sridhar

@ajaysridhar0

24 days

Why not feed the policy with a long video history? This computationally expensive and brittle under the covariate shift during deployment. Key idea: The most relevant information is often in just a few keyframes. (2/8)

1

0

14

Yanjiang Guo

@GYanjiang

26 days

Rollouts in the real world are slow and expensive. What if we could rollout trajectories entirely inside a world model (WM)? Introducing 🚀Ctrl-World🚀, a generative manipulation WM that can interact with advanced VLA policy in imagination. 🧵1/6

5

40

210

noriaki_hirose

@NoriakiHirose

2 months

We trained OmniVLA, a robotic foundation model for navigation conditioned on language, goal poses, and images. Initialized with OpenVLA, it leverages Internet-scale knowledge for strong OOD performance. Great collaboration with @CatGlossop, @shahdhruv_, and @svlevine.

6

64

328

Jubayer Ibn Hamid

@jubayer_hamid

2 months

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks

18

136

1K

Vincent de Bakker

@v_debakker

5 months

Can we teach dexterous robot hands manipulation without human demos or hand-crafted rewards? Our key insight: Use Vision-Language Models (VLMs) to scaffold coarse motion plans, then train an RL agent to execute them with 3D keypoints as the interface. 1/7

1

14

64

Jenn Grannen

@jenngrannen

5 months

Meet ProVox: a proactive robot teammate that gets you 🤖❤️‍🔥 ProVox models your goals and expectations before a task starts — enabling personalized, proactive help for smoother, more natural collaboration. All powered by LLM commonsense. Recently accepted at @ieeeras R-AL! 🧵1/7

3

14

69

Mengda Xu

@mengdaxu__

6 months

Can we collect robot dexterous hand data directly with human hand? Introducing DexUMI: 0 teleoperation and 0 re-targeting dexterous hand data collection system → autonomously complete precise, long-horizon and contact-rich tasks Project Page: https://t.co/z3CUS5wgMx

9

51

254

Perry Dong

@perryadong

6 months

Robotic models are advancing rapidly—but how do we scale their improvement? 🤖 We propose a recipe for batch online RL (train offline with online rollouts) that enables policies to self-improve without complications of online RL More: https://t.co/vr1IZqbAiq (1/8)

1

16

110

noriaki_hirose

@NoriakiHirose

6 months

We released our most recent research for learning generalized visual navigation policy from scalable, but low-quality and action-free passive data sources. Our policy can navigate various robots in diverse environments. web: https://t.co/ghjmJp7f8K paper: https://t.co/LzaMKWMflA

7

9

36

Suneel Belkhale

@suneel_belkhale

8 months

Trained a robot policy and want to measure generalization? Generalization evals vary across studies, and this makes progress hard to track. Enter ★-Gen: a taxonomy of generalization for manipulation that guides evaluation design and fosters new benchmarks. 🧵⬇️ 1/8

3

15

104

Patrick Yin

@patrickhyin

9 months

Current RL finetuning methods are too inefficient to make autonomous real world robot learning tractable. We propose Simulation-Guided Fine-Tuning (SGFT) - a simple, general sim2real framework that extracts structured exploration priors from sim to accelerate real world RL. 🧵1/6

1

26

118