How hard do you think it would be to do Image GPT but for video? That sounds like it could be pretty cool to see. Probably can be used to create some pretty trippy shit. Once it gets really good it could be used in robotics. Come to think of it, isn’t that sorta what self driving cars need? Something that looks at a video of the various things happening around the car and predicts what’s going to happen next?
Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.
How hard do you think it would be to do Image GPT but for video? That sounds like it could be pretty cool to see. Probably can be used to create some pretty trippy shit. Once it gets really good it could be used in robotics. Come to think of it, isn’t that sorta what self driving cars need? Something that looks at a video of the various things happening around the car and predicts what’s going to happen next?
Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.