It seems to do something similar to Gato where everything is just serialized into tokens, which is pretty cool
I wonder if they are just doing a standard transformer for everything, or doing some sort of diffusion model for the images inside the model?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?
It seems to do something similar to Gato where everything is just serialized into tokens, which is pretty cool
I wonder if they are just doing a standard transformer for everything, or doing some sort of diffusion model for the images inside the model?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?