Lot’s of great hype about the large multimodal models right now. But how do we get to trillions of tokens for embodied actions? My bet is VR-Teleoperation and shared autonomy.
@tom_doerr
Contact dynamics and preferred impedance etc is poorly represented in simulation and only indirectly in video. Physical interactions hide alot more complexity than what is obvious on the surface.
@BerntBornich
I think this is way tougher than it seems. For the chatgpt approach to work, we need lots of varied data in different environments. Obvious privacy + legal concerns might stop any commercial teleop system from achieving this
@chris_j_paxton
Not in anyway straightforward. Huge challenges to overcome with respect to privacy and legal. Not to mention safety. But as we are increasingly able to deploy droids with base autonomy+teleop I think it has a path to collecting the diversity of data needed.
@HDPbilly
As long as you don't loose money on deploying the droids, by having some base autonomy and good teleop to covers rest, you get cleaner data and a flywheel. Repeatedly deploying to the fleet to get the full evaluation loop requires a good fleet size.
@BerntBornich
When android manufacturing starts scaling up fast, relying only on VR teleoperation to reach trillions of sensorimotor tokens will be expensive and slow.
The simulator will help, but I think the key will be to amplify a tiny number of teleop demos with online RL for millions of…
@BerntBornich
@1x__tech
If you use deep learning to correspond semantics with detected 3D environment features, you could possibly plug the semantics into LLMs, like "Approach the blue door with the weird handle."