It seems to do something similar to Gato where everything is just serialized into tokens, which is pretty cool
I wonder if they are just doing a standard transformer for everything, or doing some sort of diffusion model for the images inside the model?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?
Current theme: default
Less Wrong (text)
Less Wrong (link)
It seems to do something similar to Gato where everything is just serialized into tokens, which is pretty cool
I wonder if they are just doing a standard transformer for everything, or doing some sort of diffusion model for the images inside the model?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?