Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.
Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.