Doing compression is not the goal of computer vision.
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.