That may be true in general, but LSVRC is much better about it. It’s run like a Kaggle competition. They have a secret test set which no one can look at to train their algorithms on. They limit the number of evaluations you can do on the test set, which is what happened here. I also believe that the public test set is different than the private one, which is only used at the end of the competition, and no one can see how well they are doing on that.
Doing compression is not the goal of computer vision. Compression is only the goal of (some forms of) unsupervised learning, which has fallen out of favor in the last few years. Karpathy discusses some of the issues with it here:
I couldn’t see how Unsupervised Learning based solely on images could work. To an unsupervised algorithm, a patch of pixels with a face on it is exactly as exciting as a patch that contains some weird edge/corner/grass/tree noise stuff. The algorithm shouldn’t worry about the latter but it should spent extra effort worrying about the former. But you would never know this if all you had was a billion patches! It all comes down to this question: if all you have are pixels and nothing else, what distinguishes images of a face, or objects from a random bush, or a corner in the ceilings of a room?...
I struggled with this question for a long time and the ironic answer I’m slowly converging on is: nothing. In absence of labels, there is no difference. So unless we want our algorithms to develop powerful features for faces (and things we care about a lot) alongside powerful features for a sea of background garbage, we may have to pay in labels.
Doing compression is not the goal of computer vision.
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.
That may be true in general, but LSVRC is much better about it. It’s run like a Kaggle competition. They have a secret test set which no one can look at to train their algorithms on. They limit the number of evaluations you can do on the test set, which is what happened here. I also believe that the public test set is different than the private one, which is only used at the end of the competition, and no one can see how well they are doing on that.
Doing compression is not the goal of computer vision. Compression is only the goal of (some forms of) unsupervised learning, which has fallen out of favor in the last few years. Karpathy discusses some of the issues with it here:
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.