This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.