Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.
Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case.
Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case.