I’m not saying “I think humans will always get scores better than computers on this task”. I’m saying:
Score on this task is clearly related to actual object recognition ability, but as the error rates get low and we start looking at the more difficult examples the relationship gets more complicated and it starts to be important to look at what kind of failures we’re seeing on each side.
What humans find difficult here is fine-grained identification of a zillion different breeds of dog, coping with having an objectively-inadequate training set (presumably to avoid intolerable boredom), and keeping track of the details of what categories the test is concerned with.
What computers find difficult here is identifying small or thin things, identifying things whose colours and contrast are unexpected, identifying things that are at unexpected angles, identifying things represented “indirectly” (paintings, models, shadows, …), identifying objects when there are a bunch of other objects also in the frame, identifying objects parts of which are obscured by other things, identifying objects by labels on them, …
To put it differently, it seems to me that almost none of the problems that a skilled human has here are actually vision failures in any useful sense, whereas most of the problems the best computers have are. And that while it’s nice that images that elicit these failures are fairly rare in the ILSVRC dataset, it’s highly plausible that difficulty in handling such images might be a much more serious handicap in “everyday vision tasks” than not being able to distinguish between dozens of species of dog, or finding it difficult to remember hundreds of specific categories that one’s expected to classify things into.
For the avoidance of doubt, I think identifying ILSVRC images with ~95% accuracy (in the sense relevant here) is really impressive. Doing it in milliseconds, even more so. There is no question that in some respects computer vision is already way ahead of human vision. But this is not at all the same thing as saying computers are better overall at “any kind of everyday vision task” and I think the evidence from ILSVRC results is that there are some quite fundamental ways in which computers are still much worse at vision than humans, and it’s not obvious to me that their advantages are going to make up for those deficiencies in the next few years.
They might. The best computers are now much better at chess than the best humans overall, even though there are (I think) still some quite fundamental things they do worse than humans. Perhaps vision is like chess in this respect. But I don’t see that the evidence is there yet that it is.
You’ve been making very confident pronouncements in this discussion, and telling other people they don’t know what they’re talking about. May I ask what your expertise is in this area? E.g., are you a computer vision researcher yourself? (I am not. I’m a mathematician working in industry, I’ve spent much of my career working with computer input devices, and have seen many times how something can (1) work well 99% of the time and (2) be almost completely unusable because of that last 1%. But there’s no AI in these devices and the rare failures of something like GoogLeNet may be less harmful.)
I’m not saying “I think humans will always get scores better than computers on this task”. I’m saying:
Score on this task is clearly related to actual object recognition ability, but as the error rates get low and we start looking at the more difficult examples the relationship gets more complicated and it starts to be important to look at what kind of failures we’re seeing on each side.
What humans find difficult here is fine-grained identification of a zillion different breeds of dog, coping with having an objectively-inadequate training set (presumably to avoid intolerable boredom), and keeping track of the details of what categories the test is concerned with.
What computers find difficult here is identifying small or thin things, identifying things whose colours and contrast are unexpected, identifying things that are at unexpected angles, identifying things represented “indirectly” (paintings, models, shadows, …), identifying objects when there are a bunch of other objects also in the frame, identifying objects parts of which are obscured by other things, identifying objects by labels on them, …
To put it differently, it seems to me that almost none of the problems that a skilled human has here are actually vision failures in any useful sense, whereas most of the problems the best computers have are. And that while it’s nice that images that elicit these failures are fairly rare in the ILSVRC dataset, it’s highly plausible that difficulty in handling such images might be a much more serious handicap in “everyday vision tasks” than not being able to distinguish between dozens of species of dog, or finding it difficult to remember hundreds of specific categories that one’s expected to classify things into.
For the avoidance of doubt, I think identifying ILSVRC images with ~95% accuracy (in the sense relevant here) is really impressive. Doing it in milliseconds, even more so. There is no question that in some respects computer vision is already way ahead of human vision. But this is not at all the same thing as saying computers are better overall at “any kind of everyday vision task” and I think the evidence from ILSVRC results is that there are some quite fundamental ways in which computers are still much worse at vision than humans, and it’s not obvious to me that their advantages are going to make up for those deficiencies in the next few years.
They might. The best computers are now much better at chess than the best humans overall, even though there are (I think) still some quite fundamental things they do worse than humans. Perhaps vision is like chess in this respect. But I don’t see that the evidence is there yet that it is.
You’ve been making very confident pronouncements in this discussion, and telling other people they don’t know what they’re talking about. May I ask what your expertise is in this area? E.g., are you a computer vision researcher yourself? (I am not. I’m a mathematician working in industry, I’ve spent much of my career working with computer input devices, and have seen many times how something can (1) work well 99% of the time and (2) be almost completely unusable because of that last 1%. But there’s no AI in these devices and the rare failures of something like GoogLeNet may be less harmful.)