The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest.
Yup, two big chapters of my book is about how terrible the evaluation systems of mainstream CV and NLP are. Instead of image classification (or whatever), researchers should write programs to do lossless compression of large image databases. This metric is absolutely ungameable, and also more meaningful.
That may be true in general, but LSVRC is much better about it. It’s run like a Kaggle competition. They have a secret test set which no one can look at to train their algorithms on. They limit the number of evaluations you can do on the test set, which is what happened here. I also believe that the public test set is different than the private one, which is only used at the end of the competition, and no one can see how well they are doing on that.
Doing compression is not the goal of computer vision. Compression is only the goal of (some forms of) unsupervised learning, which has fallen out of favor in the last few years. Karpathy discusses some of the issues with it here:
I couldn’t see how Unsupervised Learning based solely on images could work. To an unsupervised algorithm, a patch of pixels with a face on it is exactly as exciting as a patch that contains some weird edge/corner/grass/tree noise stuff. The algorithm shouldn’t worry about the latter but it should spent extra effort worrying about the former. But you would never know this if all you had was a billion patches! It all comes down to this question: if all you have are pixels and nothing else, what distinguishes images of a face, or objects from a random bush, or a corner in the ceilings of a room?...
I struggled with this question for a long time and the ironic answer I’m slowly converging on is: nothing. In absence of labels, there is no difference. So unless we want our algorithms to develop powerful features for faces (and things we care about a lot) alongside powerful features for a sea of background garbage, we may have to pay in labels.
Doing compression is not the goal of computer vision.
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.
I can look at a picture of a face and know that it’s a face. If you switched a bunch of pixels around, or blurred parts of the image a little bit, I’d still know it was a face. To me it seems relevant that it’s a picture of a face, but not as relevant what all the pixels are. Does AI need to be able to do lossless compression to have understanding?
I suppose the response might be that if you have a bunch of pictures of faces, and know that they’re faces, then you ought to be able to get some mileage out of that. And even if you’re trying to remember all the pixels, there’s less information to store if you’re just diff-ing from what your face-understanding algorithm predicts is most likely. Is that it?
Well, lossless compression implies understanding. Lossy compression may or may not imply understanding.
Also, usually you can get a lossy compression algorithm from a lossless one. In image compression, the lossless method would typically be to send a scene description plus a low-entropy correction image; you can easily save bits by just skipping the correction image.
I emphasize lossless compression because it enables strong comparisons between competing methods.
Not really, at least not until you start to approach Kolmogorov complexity.
In a natural image, most of the information is low level detail that has little or no human-relevant meaning: stuff like textures, background, lighting properties, minuscule shape details, lens artifacts, lossy compression artifacts (if the image was crawled from the Internet it was probably a JPEG originally), and so on. Lots of this detail is highly redundant and/or can be well modeled by priors, therefore a lossless compression algorithm could be very good at finding an efficient encoding of it.
A typical image used in machine learning contests is 256 x 256 x 3 x 8 =~ 1.57 million bits. How many bits of meaningful information (*) could it possibly contain? 10? 100? 1000? Whatever the number is, the amount of non-meaningful information certainly dominates, therefore an efficient lossless compression algorithm could obtain an extremely good compression ratio without compressing and thus understanding any amount of meaningful information.
(* consider meaningful information of an image as the number of yes-or-no questions about the image that a human could be normally interested in and would be able to answer by looking at the image, where for each question the probability of the answer being true is approximately 50% over the data set, and the set of question is designed in a way that allows a human to know as much as possible by asking the least number of questions, e.g. something like a 20 questions game.)
I agree with your general point that working on lossless compression requires the researcher to pay attention to details that most people would consider meaningless or irrelevant. In my own text compression work, I have to pay a lot of attention to things like capitalization, comma placement, the difference between Unicode quote characters, etc etc. However, I have three responses to this as a critique of the research program:
The first response is to say that nothing is truly irrelevant. Or, equivalently, the vision system should not attempt to make the relevance distinction. Details that are irrelevant in everyday tasks might suddenly become very relevant in a crime scene investigation (where did this shadow at the edge of the image come from...?). Also, even if a detail is irrelevant at the top level, it might be relevant in the interpretation process; certainly shadowing is very important in the human visual system.
The second response is that while it is difficult and time-consuming to worry about details, this is a small price to pay for the overall goal of objectivity and methodological rigor. Human science has always required a large amount of tedious lab work and unglamorous experimental work.
The third response is to say that even if some phenomenon is considered irrelevant by “end users”, scientists are interested in understanding reality for its own sake, not for the sake of applications. So pure vision scientists should be very interested in, say, categorizing textures, modeling shadows and lighting, and lens artifacts (Actually, in my interactions with computer graphics people, I have found this exact tendency).
By your definition of meaningful information, it’s not actually clear that a strong lossless compressor wouldn’t discover and encode that meaningful information.
For example the presence of a face in an image is presumably meaningful information. From a compression point of view, the presence of a face and it’s approximate pose is also information that has a very large impact on lower level feature coding, in that spending say 100 bits to represent the face and it’s pose could save 10x as many bits in the lowest levels. Some purely unsupervised learning systems—such as sparse coding for example or RBMs—do tend to find high level features that correspond to objects (meaningful information).
Of course that does not imply that training using UL compression criteria is the best way to recognize any particular features/objects.
By your definition of meaningful information, it’s not actually clear that a strong lossless compressor wouldn’t discover and encode that meaningful information.
It could, but also it could not. My point is that compression ratio (that is, average log-likelihood of the data under the model) is not a good proxy for “understanding” since it can be optimized to a very large extent without modeling “meaningful” information.
Yes, good compression can be achieved without deep understanding. But a compressor with deep understanding will ultimately achieve better compression. For example, you can get good text compression results with a simple bigram or trigram model, but eventually a sophisticated grammar-based model will outperform the Ngram approach.
If you really believe that slowing progress on AGI is a good thing, you should do it by encouraging young people to go into different fields, not by encouraging people to waste their careers.
Yup, two big chapters of my book is about how terrible the evaluation systems of mainstream CV and NLP are. Instead of image classification (or whatever), researchers should write programs to do lossless compression of large image databases. This metric is absolutely ungameable, and also more meaningful.
That may be true in general, but LSVRC is much better about it. It’s run like a Kaggle competition. They have a secret test set which no one can look at to train their algorithms on. They limit the number of evaluations you can do on the test set, which is what happened here. I also believe that the public test set is different than the private one, which is only used at the end of the competition, and no one can see how well they are doing on that.
Doing compression is not the goal of computer vision. Compression is only the goal of (some forms of) unsupervised learning, which has fallen out of favor in the last few years. Karpathy discusses some of the issues with it here:
It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).
This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.
For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise—literally as in perlin style noise functions—to create huge amounts of micro-details in local texturing, displacements, etc.
If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions—which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.
Is it important that it be lossless compression?
I can look at a picture of a face and know that it’s a face. If you switched a bunch of pixels around, or blurred parts of the image a little bit, I’d still know it was a face. To me it seems relevant that it’s a picture of a face, but not as relevant what all the pixels are. Does AI need to be able to do lossless compression to have understanding?
I suppose the response might be that if you have a bunch of pictures of faces, and know that they’re faces, then you ought to be able to get some mileage out of that. And even if you’re trying to remember all the pixels, there’s less information to store if you’re just diff-ing from what your face-understanding algorithm predicts is most likely. Is that it?
Well, lossless compression implies understanding. Lossy compression may or may not imply understanding.
Also, usually you can get a lossy compression algorithm from a lossless one. In image compression, the lossless method would typically be to send a scene description plus a low-entropy correction image; you can easily save bits by just skipping the correction image.
I emphasize lossless compression because it enables strong comparisons between competing methods.
Not really, at least not until you start to approach Kolmogorov complexity.
In a natural image, most of the information is low level detail that has little or no human-relevant meaning: stuff like textures, background, lighting properties, minuscule shape details, lens artifacts, lossy compression artifacts (if the image was crawled from the Internet it was probably a JPEG originally), and so on.
Lots of this detail is highly redundant and/or can be well modeled by priors, therefore a lossless compression algorithm could be very good at finding an efficient encoding of it.
A typical image used in machine learning contests is 256 x 256 x 3 x 8 =~ 1.57 million bits. How many bits of meaningful information (*) could it possibly contain? 10? 100? 1000?
Whatever the number is, the amount of non-meaningful information certainly dominates, therefore an efficient lossless compression algorithm could obtain an extremely good compression ratio without compressing and thus understanding any amount of meaningful information.
(* consider meaningful information of an image as the number of yes-or-no questions about the image that a human could be normally interested in and would be able to answer by looking at the image, where for each question the probability of the answer being true is approximately 50% over the data set, and the set of question is designed in a way that allows a human to know as much as possible by asking the least number of questions, e.g. something like a 20 questions game.)
I agree with your general point that working on lossless compression requires the researcher to pay attention to details that most people would consider meaningless or irrelevant. In my own text compression work, I have to pay a lot of attention to things like capitalization, comma placement, the difference between Unicode quote characters, etc etc. However, I have three responses to this as a critique of the research program:
The first response is to say that nothing is truly irrelevant. Or, equivalently, the vision system should not attempt to make the relevance distinction. Details that are irrelevant in everyday tasks might suddenly become very relevant in a crime scene investigation (where did this shadow at the edge of the image come from...?). Also, even if a detail is irrelevant at the top level, it might be relevant in the interpretation process; certainly shadowing is very important in the human visual system.
The second response is that while it is difficult and time-consuming to worry about details, this is a small price to pay for the overall goal of objectivity and methodological rigor. Human science has always required a large amount of tedious lab work and unglamorous experimental work.
The third response is to say that even if some phenomenon is considered irrelevant by “end users”, scientists are interested in understanding reality for its own sake, not for the sake of applications. So pure vision scientists should be very interested in, say, categorizing textures, modeling shadows and lighting, and lens artifacts (Actually, in my interactions with computer graphics people, I have found this exact tendency).
By your definition of meaningful information, it’s not actually clear that a strong lossless compressor wouldn’t discover and encode that meaningful information.
For example the presence of a face in an image is presumably meaningful information. From a compression point of view, the presence of a face and it’s approximate pose is also information that has a very large impact on lower level feature coding, in that spending say 100 bits to represent the face and it’s pose could save 10x as many bits in the lowest levels. Some purely unsupervised learning systems—such as sparse coding for example or RBMs—do tend to find high level features that correspond to objects (meaningful information).
Of course that does not imply that training using UL compression criteria is the best way to recognize any particular features/objects.
It could, but also it could not. My point is that compression ratio (that is, average log-likelihood of the data under the model) is not a good proxy for “understanding” since it can be optimized to a very large extent without modeling “meaningful” information.
Yes, good compression can be achieved without deep understanding. But a compressor with deep understanding will ultimately achieve better compression. For example, you can get good text compression results with a simple bigram or trigram model, but eventually a sophisticated grammar-based model will outperform the Ngram approach.
Huh? Understanding by whom? What exactly does the zip compressor understand?
It seems like if anything, we should encourage researchers to focus on gameable metrics to slow progress on AGI?
If you really believe that slowing progress on AGI is a good thing, you should do it by encouraging young people to go into different fields, not by encouraging people to waste their careers.