Note that for your “bandwidth” argument it is possible to reduce complex visual scenes to segmented maps of all the objects. (With metadata on position, rotation, grabbable regions, etc of use to a robot)
These run in realtime on current hardware.
These reduced scenes are much smaller, and can fit in the input token context for an LLM if you want to do it that way.
Similar for sound. You can also use narrow AI agents to handle robotic tasks, the high level one would issue commands and the narrow one would be much more efficient as it consumes the robot proprioception and planner frames to carry out a task.
Waymo claim they are on version 5 of their driver stack, that the new one is heavily using ML, and that it is general enough that they are starting driverless services in LA right now. This is consistent with the expectation that self driving is profitable and solvable.
Note that for your “bandwidth” argument it is possible to reduce complex visual scenes to segmented maps of all the objects. (With metadata on position, rotation, grabbable regions, etc of use to a robot)
These run in realtime on current hardware.
These reduced scenes are much smaller, and can fit in the input token context for an LLM if you want to do it that way.
Similar for sound. You can also use narrow AI agents to handle robotic tasks, the high level one would issue commands and the narrow one would be much more efficient as it consumes the robot proprioception and planner frames to carry out a task.
Waymo claim they are on version 5 of their driver stack, that the new one is heavily using ML, and that it is general enough that they are starting driverless services in LA right now. This is consistent with the expectation that self driving is profitable and solvable.