Image recognition, courtesy of the deep learning revolution & Moore’s Law for GPUs, seems near reaching human parity. The latest paper is “Deep Image: Scaling up Image Recognition”, Wu et al 2015 (Baidu):
We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation approaches, and usage of multi-scale high-resolution images. On one of the most challenging computer vision benchmarks, the ImageNet classification challenge, our system has achieved the best result to date, with a top-5 error rate of 5.98% - a relative 10.2% improvement over the previous best result.
...The result is the custom-built supercomputer, which we call Minwa 2 . It is comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory. Thanks to the GPUDirect RDMA, the InfiniBand network interface can access the remote GPU memory without involvement from the CPU. All the server nodes are connected to the InfiniBand switch. Figure 1 shows the system architecture. The system runs Linux with CUDA 6.0 and MPI MVAPICH2, which also enables GPUDirect RDMA. In total, Minwa has 6.9TB host memory, 1.7TB device memory, and about 0.6PFlops theoretical single precision peak performance...We are now capable of building very large deep neural networks up to hundreds of billions parameters thanks to dedicated supercomputers such as Minwa.
...As shown in Table 3, the accuracy has been optimized a lot during the last three years. The best result of ILSVRC 2014, top-5 error rate of 6.66%, is not far from human recognition performance of 5.1% [18]. Our work marks yet another exciting milestone with the top-5 error rate of 5.98%, not just setting the new record but also closing the gap between computers and humans by almost half.
For another comparison, on pg9 Table 3 shows past performance. In 2012, the best performer reached 16.42%; 2013 knocked it down to 11.74%, and 2014 to 6.66% or to 5.98% depending on how much of a stickler you want to be; leaving ~0.8% left.
EDIT: Google may have already beaten 5.98% with a 5.5% (and thus halved the remaining difference to 0.4%), according to a commenter on HN, “smhx”:
googlenet already has 5.5%, they published it at a bay area meetup, but did not officially publish the numbers yet!
… A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
I’m not sure what those or earlier results mean, practically speaking. And the increased use of data augmentation may mean that the newer neural networks don’t show that behavior, pace those papers showing it’s useful to add the adversarial examples to the training sets.
‘Fuzzing’ and other forms of modification (I think the general term is ‘data augmentation’, and there can be quite a few different ways to modify images to increase your sample size—the paper I discuss in the grandparent spends two pages or so listing all the methods it uses) aren’t a fix.
In this case, they say they are using AlexNet which already does some data augmentation (pg5-6).
Further, if you treat the adversarial examples as another data augmentation trick and train the networks with the old examples, you can still generate more adversarial examples.
Huh. That’s surprising. So what are humans doing differently? Are we doing anything differently? Should we wonder if someone given total knowledge of my optical processing could show me a picture that I was convinced was a lion even though it was essentially random?
Those rather are the questions, aren’t they? My thought when the original paper showed up on HN was that we can’t do anything remotely similar to constructing adversarial examples for a human visual cortex, and we already know of a lot of visual illusions (I’m particularly thinking of the Magic Eyeautostereograms)… “Perhaps there are thoughts we cannot think”.
Hard to see how we could test it without solving AI, though.
I don’t think we’d need to solve AI to test this. If we could get a detailed enough understanding of how the optical cortex functions it might be doable. Alternatively, we could try it on a very basic uploaded mouse or similar creature. On the other hand, if we can upload mice then we’re pretty close to uploading people, and if we can upload people we’ve got AI.
I’m not sure if NNs already do this, but perhaps using augmentation on the runtime input might help? Similar to how humans can look at things in different lights or at different angles if needed.
To update: the latest version of the Baidu paper now claims to have gone from the 5.98% above to 4.58%.
EDIT: on 2 June, a notification (Redditdiscussion) was posted; apparently the Baidu team made far more than the usual number of submissions to test how their neural network was performing on the held-out ImageNet sample. This is problematic because it means that some amount of their performance gain is probably due to overfitting (tweak a setting, submit, see if performance improves, repeat). The Google team is not accused of doing this, so probably the true state-of-the-art error rate is somewhere between the 3rd Baidu version and the last Google rate.
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
(Surprised it wasn’t a Baidu team who won.) I suppose now we’ll need even harder problem sets for deep learning… Maybe video? Doesn’t seem like a lot of work on that yet compared to static image recognition.
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
...The current reported best result on the ImageNet Large Scale Visual Recognition Competition is reached by the Deep Image ensemble of traditional models Wu et al. (2015). Here we report a top-5 validation error of 4.9% (and 4.82% on the test set), which improves upon the previous best result despite using 15X fewer parameters and lower resolution receptive field. Our system exceeds the estimated accuracy of human raters according to Russakovsky et al. (2014).
… About ~3% is an optimistic estimate without my “silly errors”.
...I don’t at all intend this post to somehow take away from any of the recent results: I’m very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress. It seems that we’re now surpassing a dedicated human labeler. And imo, when we are down to 3%, we’d matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers.
Image recognition, courtesy of the deep learning revolution & Moore’s Law for GPUs, seems near reaching human parity. The latest paper is “Deep Image: Scaling up Image Recognition”, Wu et al 2015 (Baidu):
For another comparison, on pg9 Table 3 shows past performance. In 2012, the best performer reached 16.42%; 2013 knocked it down to 11.74%, and 2014 to 6.66% or to 5.98% depending on how much of a stickler you want to be; leaving ~0.8% left.
EDIT: Google may have already beaten 5.98% with a 5.5% (and thus halved the remaining difference to 0.4%), according to a commenter on HN, “smhx”:
On the other hand… Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
From the abstract:
I’m not sure what those or earlier results mean, practically speaking. And the increased use of data augmentation may mean that the newer neural networks don’t show that behavior, pace those papers showing it’s useful to add the adversarial examples to the training sets.
It seems like the work around for that is to fuzz the images slightly before feeding them to the neural net?
‘Fuzzing’ and other forms of modification (I think the general term is ‘data augmentation’, and there can be quite a few different ways to modify images to increase your sample size—the paper I discuss in the grandparent spends two pages or so listing all the methods it uses) aren’t a fix.
In this case, they say they are using AlexNet which already does some data augmentation (pg5-6).
Further, if you treat the adversarial examples as another data augmentation trick and train the networks with the old examples, you can still generate more adversarial examples.
Huh. That’s surprising. So what are humans doing differently? Are we doing anything differently? Should we wonder if someone given total knowledge of my optical processing could show me a picture that I was convinced was a lion even though it was essentially random?
Those rather are the questions, aren’t they? My thought when the original paper showed up on HN was that we can’t do anything remotely similar to constructing adversarial examples for a human visual cortex, and we already know of a lot of visual illusions (I’m particularly thinking of the Magic Eye autostereograms)… “Perhaps there are thoughts we cannot think”.
Hard to see how we could test it without solving AI, though.
I don’t think we’d need to solve AI to test this. If we could get a detailed enough understanding of how the optical cortex functions it might be doable. Alternatively, we could try it on a very basic uploaded mouse or similar creature. On the other hand, if we can upload mice then we’re pretty close to uploading people, and if we can upload people we’ve got AI.
I’m not sure if NNs already do this, but perhaps using augmentation on the runtime input might help? Similar to how humans can look at things in different lights or at different angles if needed.
To update: the latest version of the Baidu paper now claims to have gone from the 5.98% above to 4.58%.
EDIT: on 2 June, a notification (Reddit discussion) was posted; apparently the Baidu team made far more than the usual number of submissions to test how their neural network was performing on the held-out ImageNet sample. This is problematic because it means that some amount of their performance gain is probably due to overfitting (tweak a setting, submit, see if performance improves, repeat). The Google team is not accused of doing this, so probably the true state-of-the-art error rate is somewhere between the 3rd Baidu version and the last Google rate.
That is shocking and somewhat disturbing.
Human performance on image-recognition surpassed by MSR? “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, He et al 2015 (Reddit; emphasis added):
(Surprised it wasn’t a Baidu team who won.) I suppose now we’ll need even harder problem sets for deep learning… Maybe video? Doesn’t seem like a lot of work on that yet compared to static image recognition.
The record has apparently been broken again: “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (HN, Reddit), Ioffe & Szegedy 2015:
On the human-level accuracy rate: