We won’t have robot butlers or maids in the next ten years.
Academic CV researchers will write a lot of papers, but there won’t be any big commercial successes that are based on dramatic improvements in CV science. This is a subtle point: there may be big CV successes, but they will be based on figuring out ways to use CV-like technology that avoids grappling with the real hardness of the problem. For example, the main big current uses of CV are in industrial applications where you can control precisely things like lighting, clutter, camera position, and so on.
Assistant and intent-based technology will continue to be annoying and not very useful.
Similar to CV, robotics will work okay when you can control precisely the nature of the task and environment. We won’t have, for example, robot construction workers.
Academic CV researchers will write a lot of papers, but there won’t be any big commercial successes that are based on dramatic improvements in CV science. This is a subtle point: there may be big CV successes, but they will be based on figuring out ways to use CV-like technology that avoids grappling with the real hardness of the problem
Do driverless cars that drive on normal streets count?
Driverless cars are actually a good illustration of my point. These cars use CV at some level, but they depend fundamentally on laser range finders, GPS, and precompiled road maps. There’s no way modern CV alone could work reliably enough in such a potentially dangerous and legally fraught situation.
Okay, though we’re still far from a true robot butler. I don’t know if we’re ten years away though, especially if you’re tolerant in what you expect a butler to be able to do (welcome guests, take their names, point them in the right direction, answer basic questions? We can already do it. Go up a flight of stairs? Not yet.)
I would be surprised if any of these predictions come true. There have already been huge advances in machine vision and they are starting to beat humans at many tasks. Obviously it takes time for new technology to reach the market, but 10 years is plenty. Right now there are a number of startups working on it, and the big tech companies have hired all the researchers.
huge advances in machine vision and they are starting to beat humans at many tasks
The idea that computers are better than humans at any kind of everyday vision task is just not true. Papers that report “better than human” performance typically just mean that their algorithms do better than cross-annotator agreement. The field should actually regard the fact that people can write papers reporting such things as more of an embarrassment than a success, since they are really illustrating a (profound) failure of the evaluation paradigm, not deep conceptual or technical achievements.
You don’t know what you are talking about. Last year’s ImageNet Large Scale Visual Recognition Challenge, the top competitor got 6.66% classification error on guessing the correct classification in 5 guesses.
A human tried this challenge and estimated his performance at 5.1%, and that requires extensive time practicing and finding reference images.
Just recently a paper came out reporting 4.94% error. And for the last few years, the best competitor has consistently halved the best error from the year before. So by the time this year’s competition comes out it should be down to 3%!
So by the time this year’s competition comes out it should be down to 3%!
I’m not sure ImageNet is of sufficiently high quality that a 3% error rate is meaningful. No point in overfitting noise in the supposed right labels. I think the take-away is that image recognition has gotten really good and now we need a new benchmark/corpus, possibly focused on the special-cases where humans still seem better.
Well the algorithms used are fairly general. If you can classify an image, you can detect the objects in them and where they are.
The tasks are high interrelated. In classification they search different parts of the images at different scales to try to find a match. And in localization they run a general classifier across the image and find where it detects objects.
None of that has much to do with whether the task in question is an “everyday vision task”.
(And: How closely did you read the article about a human trying the challenge? Something like 2⁄3 of his errors were (1) a matter of not being able to identify specific varieties of dog etc. reliably, (2) not being familiar with the specific set of 1000 labels used by the ILSVRC, and (3) not having seen enough examples—typically of particular varieties of dog etc. -- in the training set to be able to make a good call. I think the comparison of error rates gives a poor indication of relative performance—unless what you’re mostly interested in is classifying breeds of dog, I guess.)
He estimates an ensemble of humans could get up to 3% error, under extremely idealistic and totally hypothetical conditions, and with lots of hindsight bias over the mistakes he made the first time.
I did mention that even getting 5% error requires extreme amount of effort sorting through reference images and stuff. While the machine can spit out answers in milliseconds.
In the next few years computers will mop up humans on all vision tasks. Machine vision is quite nearly a solved problem.
I’m not saying “I think humans will always get scores better than computers on this task”. I’m saying:
Score on this task is clearly related to actual object recognition ability, but as the error rates get low and we start looking at the more difficult examples the relationship gets more complicated and it starts to be important to look at what kind of failures we’re seeing on each side.
What humans find difficult here is fine-grained identification of a zillion different breeds of dog, coping with having an objectively-inadequate training set (presumably to avoid intolerable boredom), and keeping track of the details of what categories the test is concerned with.
What computers find difficult here is identifying small or thin things, identifying things whose colours and contrast are unexpected, identifying things that are at unexpected angles, identifying things represented “indirectly” (paintings, models, shadows, …), identifying objects when there are a bunch of other objects also in the frame, identifying objects parts of which are obscured by other things, identifying objects by labels on them, …
To put it differently, it seems to me that almost none of the problems that a skilled human has here are actually vision failures in any useful sense, whereas most of the problems the best computers have are. And that while it’s nice that images that elicit these failures are fairly rare in the ILSVRC dataset, it’s highly plausible that difficulty in handling such images might be a much more serious handicap in “everyday vision tasks” than not being able to distinguish between dozens of species of dog, or finding it difficult to remember hundreds of specific categories that one’s expected to classify things into.
For the avoidance of doubt, I think identifying ILSVRC images with ~95% accuracy (in the sense relevant here) is really impressive. Doing it in milliseconds, even more so. There is no question that in some respects computer vision is already way ahead of human vision. But this is not at all the same thing as saying computers are better overall at “any kind of everyday vision task” and I think the evidence from ILSVRC results is that there are some quite fundamental ways in which computers are still much worse at vision than humans, and it’s not obvious to me that their advantages are going to make up for those deficiencies in the next few years.
They might. The best computers are now much better at chess than the best humans overall, even though there are (I think) still some quite fundamental things they do worse than humans. Perhaps vision is like chess in this respect. But I don’t see that the evidence is there yet that it is.
You’ve been making very confident pronouncements in this discussion, and telling other people they don’t know what they’re talking about. May I ask what your expertise is in this area? E.g., are you a computer vision researcher yourself? (I am not. I’m a mathematician working in industry, I’ve spent much of my career working with computer input devices, and have seen many times how something can (1) work well 99% of the time and (2) be almost completely unusable because of that last 1%. But there’s no AI in these devices and the rare failures of something like GoogLeNet may be less harmful.)
I have anti-predictions:
We won’t have robot butlers or maids in the next ten years.
Academic CV researchers will write a lot of papers, but there won’t be any big commercial successes that are based on dramatic improvements in CV science. This is a subtle point: there may be big CV successes, but they will be based on figuring out ways to use CV-like technology that avoids grappling with the real hardness of the problem. For example, the main big current uses of CV are in industrial applications where you can control precisely things like lighting, clutter, camera position, and so on.
Assistant and intent-based technology will continue to be annoying and not very useful.
Similar to CV, robotics will work okay when you can control precisely the nature of the task and environment. We won’t have, for example, robot construction workers.
Do driverless cars that drive on normal streets count?
Driverless cars are actually a good illustration of my point. These cars use CV at some level, but they depend fundamentally on laser range finders, GPS, and precompiled road maps. There’s no way modern CV alone could work reliably enough in such a potentially dangerous and legally fraught situation.
(for what it’s worth, I work on this robot for a living)
And how is it going?
Okay, though we’re still far from a true robot butler. I don’t know if we’re ten years away though, especially if you’re tolerant in what you expect a butler to be able to do (welcome guests, take their names, point them in the right direction, answer basic questions? We can already do it. Go up a flight of stairs? Not yet.)
You can always just weld the butler on top of Spot https://www.youtube.com/watch?v=M8YjvHYbZ9w (this does not seem to be a significant blocker)
Cool project! Do you think those robots are going to be a big commercial success?
There are already quite a few of them deployed in stores in Japan, interacting with customers, so for now it’s going okay :)
The prediction about CV doesn’t seem to have aged that well in my view. Others are going fairly well!
I would be surprised if any of these predictions come true. There have already been huge advances in machine vision and they are starting to beat humans at many tasks. Obviously it takes time for new technology to reach the market, but 10 years is plenty. Right now there are a number of startups working on it, and the big tech companies have hired all the researchers.
The idea that computers are better than humans at any kind of everyday vision task is just not true. Papers that report “better than human” performance typically just mean that their algorithms do better than cross-annotator agreement. The field should actually regard the fact that people can write papers reporting such things as more of an embarrassment than a success, since they are really illustrating a (profound) failure of the evaluation paradigm, not deep conceptual or technical achievements.
You don’t know what you are talking about. Last year’s ImageNet Large Scale Visual Recognition Challenge, the top competitor got 6.66% classification error on guessing the correct classification in 5 guesses.
A human tried this challenge and estimated his performance at 5.1%, and that requires extensive time practicing and finding reference images.
Just recently a paper came out reporting 4.94% error. And for the last few years, the best competitor has consistently halved the best error from the year before. So by the time this year’s competition comes out it should be down to 3%!
I’m not sure ImageNet is of sufficiently high quality that a 3% error rate is meaningful. No point in overfitting noise in the supposed right labels. I think the take-away is that image recognition has gotten really good and now we need a new benchmark/corpus, possibly focused on the special-cases where humans still seem better.
You are only actually disagreeing with Daniel in so far as
in the ILSVRC is actually
which is far from clear to me.
Well the algorithms used are fairly general. If you can classify an image, you can detect the objects in them and where they are.
The tasks are high interrelated. In classification they search different parts of the images at different scales to try to find a match. And in localization they run a general classifier across the image and find where it detects objects.
In fact the classifier is now being used to actually describe images in natural language.
None of that has much to do with whether the task in question is an “everyday vision task”.
(And: How closely did you read the article about a human trying the challenge? Something like 2⁄3 of his errors were (1) a matter of not being able to identify specific varieties of dog etc. reliably, (2) not being familiar with the specific set of 1000 labels used by the ILSVRC, and (3) not having seen enough examples—typically of particular varieties of dog etc. -- in the training set to be able to make a good call. I think the comparison of error rates gives a poor indication of relative performance—unless what you’re mostly interested in is classifying breeds of dog, I guess.)
He estimates an ensemble of humans could get up to 3% error, under extremely idealistic and totally hypothetical conditions, and with lots of hindsight bias over the mistakes he made the first time.
I did mention that even getting 5% error requires extreme amount of effort sorting through reference images and stuff. While the machine can spit out answers in milliseconds.
In the next few years computers will mop up humans on all vision tasks. Machine vision is quite nearly a solved problem.
I’m not saying “I think humans will always get scores better than computers on this task”. I’m saying:
Score on this task is clearly related to actual object recognition ability, but as the error rates get low and we start looking at the more difficult examples the relationship gets more complicated and it starts to be important to look at what kind of failures we’re seeing on each side.
What humans find difficult here is fine-grained identification of a zillion different breeds of dog, coping with having an objectively-inadequate training set (presumably to avoid intolerable boredom), and keeping track of the details of what categories the test is concerned with.
What computers find difficult here is identifying small or thin things, identifying things whose colours and contrast are unexpected, identifying things that are at unexpected angles, identifying things represented “indirectly” (paintings, models, shadows, …), identifying objects when there are a bunch of other objects also in the frame, identifying objects parts of which are obscured by other things, identifying objects by labels on them, …
To put it differently, it seems to me that almost none of the problems that a skilled human has here are actually vision failures in any useful sense, whereas most of the problems the best computers have are. And that while it’s nice that images that elicit these failures are fairly rare in the ILSVRC dataset, it’s highly plausible that difficulty in handling such images might be a much more serious handicap in “everyday vision tasks” than not being able to distinguish between dozens of species of dog, or finding it difficult to remember hundreds of specific categories that one’s expected to classify things into.
For the avoidance of doubt, I think identifying ILSVRC images with ~95% accuracy (in the sense relevant here) is really impressive. Doing it in milliseconds, even more so. There is no question that in some respects computer vision is already way ahead of human vision. But this is not at all the same thing as saying computers are better overall at “any kind of everyday vision task” and I think the evidence from ILSVRC results is that there are some quite fundamental ways in which computers are still much worse at vision than humans, and it’s not obvious to me that their advantages are going to make up for those deficiencies in the next few years.
They might. The best computers are now much better at chess than the best humans overall, even though there are (I think) still some quite fundamental things they do worse than humans. Perhaps vision is like chess in this respect. But I don’t see that the evidence is there yet that it is.
You’ve been making very confident pronouncements in this discussion, and telling other people they don’t know what they’re talking about. May I ask what your expertise is in this area? E.g., are you a computer vision researcher yourself? (I am not. I’m a mathematician working in industry, I’ve spent much of my career working with computer input devices, and have seen many times how something can (1) work well 99% of the time and (2) be almost completely unusable because of that last 1%. But there’s no AI in these devices and the rare failures of something like GoogLeNet may be less harmful.)