Wild. If I’m reading the paper right, this uses the same dataset as RT-1 to ground the finetuning of the robot-commanding tokens, they just get to fine-tune an off the shelf multimodal transformer rather than having to make a custom solution (as in RT-1), and it works better.
Wild. If I’m reading the paper right, this uses the same dataset as RT-1 to ground the finetuning of the robot-commanding tokens, they just get to fine-tune an off the shelf multimodal transformer rather than having to make a custom solution (as in RT-1), and it works better.