Skip to content

ImageNet classification with deep convolutional neural networks

Why this mattered

This paper marked the point at which deep convolutional neural networks became the dominant approach to large-scale visual recognition. Its importance was not just that it won ILSVRC-2012, but that it won by an unusually large margin: a 15.3% top-5 error rate versus 26.2% for the next best system. That result showed that learned hierarchical visual features, trained end-to-end on a very large labeled dataset, could outperform pipelines built around hand-engineered descriptors and task-specific classifiers. ImageNet’s scale was central: the paper demonstrated that deep models could exploit millions of labeled natural images rather than being limited by smaller academic benchmarks.

The work also made clear which ingredients were beginning to make deep learning practically viable: GPUs for tractable training, rectified nonlinearities for faster optimization, data augmentation and dropout for generalization, and sufficiently large networks to learn rich visual representations. None of these components was entirely new in isolation, but their combination produced a system whose empirical performance changed the field’s expectations. After this result, computer vision rapidly reorganized around deep convolutional architectures, and benchmarks that had previously advanced incrementally began to see large gains from deeper, larger, and better-regularized neural networks.

Its influence extended beyond image classification. The success of this model helped establish the broader recipe of large datasets, high-capacity neural networks, specialized hardware, and end-to-end training that later powered advances in detection, segmentation, speech recognition, machine translation, reinforcement learning, and eventually foundation models. In that sense, the paper was paradigm-shifting because it converted deep learning from a promising but contested approach into the default experimental starting point for perception problems, and it helped launch the modern era in which progress is often driven by scaling model capacity, data, and computation together.

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Sources