Lot’s of great hype about the large multimodal models right now. But how do we get to trillions of tokens for embodied actions? My bet is VR-Teleoperation and shared autonomy. Tweet added by Bernt Bornich @BerntBornich

Bernt Bornich

1 year

Lot’s of great hype about the large multimodal models right now. But how do we get to trillions of tokens for embodied actions? My bet is VR-Teleoperation and shared autonomy.

13

25

157

Bernt Bornich

@BerntBornich

1 year

@1x__tech

1

0

Tom Dörr

@tom_doerr

1 year

@BerntBornich Why not pose estimation on videos?

1

0

1

Bernt Bornich

@BerntBornich

1 year

@tom_doerr Contact dynamics and preferred impedance etc is poorly represented in simulation and only indirectly in video. Physical interactions hide alot more complexity than what is obvious on the surface.

1

0

2

Chris Paxton

@chris_j_paxton

1 year

@BerntBornich I think this is way tougher than it seems. For the chatgpt approach to work, we need lots of varied data in different environments. Obvious privacy + legal concerns might stop any commercial teleop system from achieving this

1

0

1

Bernt Bornich

@BerntBornich

1 year

@chris_j_paxton Not in anyway straightforward. Huge challenges to overcome with respect to privacy and legal. Not to mention safety. But as we are increasingly able to deploy droids with base autonomy+teleop I think it has a path to collecting the diversity of data needed.

0

1

HDP

@HDPbilly

1 year

@BerntBornich Yep… or… cameras and sensor suits on actual laborers… probably cheaper and better

1

0

Bernt Bornich

@BerntBornich

1 year

@HDPbilly As long as you don't loose money on deploying the droids, by having some base autonomy and good teleop to covers rest, you get cleaner data and a flywheel. Repeatedly deploying to the fleet to get the full evaluation loop requires a good fleet size.

0

Parav

@paravn

1 year

@BerntBornich Teleoperation is hard to scale up. Also, how do you generalize tokens from one hardware stack to another?

0

2

Peter

@PeterVilleroy

1 year

@BerntBornich Tell me more!

0

Dhanush

@dhanushisrad

1 year

@BerntBornich When android manufacturing starts scaling up fast, relying only on VR teleoperation to reach trillions of sensorimotor tokens will be expensive and slow. The simulator will help, but I think the key will be to amplify a tiny number of teleop demos with online RL for millions of…

1

0

1

Duncan Calvert

@duncancalvert

1 year

@BerntBornich @1x__tech If you use deep learning to correspond semantics with detected 3D environment features, you could possibly plug the semantics into LLMs, like "Approach the blue door with the weird handle."

0