Going deeper with convolutions¶
Why this mattered¶
“Going deeper with convolutions” mattered because it showed that better vision models did not have to come only from making networks larger in a brute-force way. The Inception design made depth and width practical under a fixed compute budget by using parallel convolutional paths, including inexpensive 1x1 convolutions for dimensionality reduction. This turned architectural efficiency into a central research object: the paper’s contribution was not merely a high-scoring ImageNet entry, but a template for extracting more representational power per operation.
GoogLeNet also helped shift convolutional network design away from simple sequential stacks toward modular, multi-branch architectures. Its Inception modules encoded the idea that visual features should be processed at multiple spatial scales inside the same layer, while still remaining computationally tractable. That made it newly plausible to train much deeper models for large-scale recognition and detection without requiring a proportional increase in memory and computation.
The paper sits between AlexNet’s demonstration that deep CNNs could dominate ImageNet and later breakthroughs such as residual networks, densely connected networks, and modern efficient architectures. Inception did not solve the optimization problems of very deep networks as directly as ResNet later would, but it established that hand-designed architectural topology could be a decisive source of progress. Its influence is visible in subsequent work on factorized convolutions, bottleneck layers, neural architecture search, and the broader practice of treating model architecture as a primary lever for scaling accuracy under real hardware constraints.
Abstract¶
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Related¶
- cite → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — GoogLeNet builds on R-CNN's demonstration that convolutional feature hierarchies improve object detection and segmentation.
- cite → Gradient-based learning applied to document recognition — GoogLeNet extends the convolutional neural network paradigm established by LeNet for gradient-based visual recognition.
- cite → Backpropagation Applied to Handwritten Zip Code Recognition — GoogLeNet relies on the backpropagation-trained convolutional networks introduced for handwritten zip-code recognition.
- cite → ImageNet classification with deep convolutional neural networks — GoogLeNet follows AlexNet in using deep convolutional networks for ImageNet-scale visual classification.
- cite ← Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network — SRGAN cites GoogLeNet as evidence that deeper convolutional networks can learn strong hierarchical image representations.
- cite ← Dermatologist-level classification of skin cancer with deep neural networks — The skin-cancer classifier uses the Inception convolutional architecture introduced by Going Deeper with Convolutions.
- cite ← Fully Convolutional Networks for Semantic Segmentation — FCN cites GoogLeNet as evidence that deep convolutional architectures can learn strong hierarchical visual features.
- cite ← Learning Deep Features for Discriminative Localization — Class activation mapping is demonstrated on GoogLeNet-style convolutional networks with global average pooling.
- cite ← Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification — The rectifier-network paper compares its very deep PReLU networks against GoogLeNet's Inception architecture on ImageNet classification.
- cite ← Deep Residual Learning for Image Recognition — ResNet contrasts residual learning with GoogLeNet's Inception modules as another route to deeper convolutional networks.
- cite ← TensorFlow: a system for large-scale machine learning — TensorFlow cites Inception networks from GoogLeNet as a large convolutional model class implemented and scaled with the system.
- cite ← Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Faster R-CNN cites GoogLeNet/Inception as a deep convolutional backbone architecture for object detection experiments.
- cite ← ImageNet classification with deep convolutional neural networks — AlexNet's convolutional-network success is a direct precursor to GoogLeNet's deeper Inception architecture for ImageNet classification.
- cite ← Squeeze-and-Excitation Networks — Squeeze-and-Excitation Networks cite GoogLeNet as an influential convolutional architecture showing that network modules can improve deep visual recognition.
- cite ← Deep Learning with Differential Privacy — Deep Learning with Differential Privacy evaluates private optimization on deep convolutional architectures related to GoogLeNet.
- enables ← Gradient-based learning applied to document recognition — LeCun's convolutional document-recognition system enables GoogLeNet by demonstrating end-to-end gradient-trained convolutional networks for visual recognition.
- enables ← Backpropagation Applied to Handwritten Zip Code Recognition — Backpropagation for zip-code recognition enables GoogLeNet by proving that multilayer convolutional networks can be trained effectively for image classification.