Skip to content

Going deeper with convolutions

Why this mattered

“Going deeper with convolutions” mattered because it showed that better vision models did not have to come only from making networks larger in a brute-force way. The Inception design made depth and width practical under a fixed compute budget by using parallel convolutional paths, including inexpensive 1x1 convolutions for dimensionality reduction. This turned architectural efficiency into a central research object: the paper’s contribution was not merely a high-scoring ImageNet entry, but a template for extracting more representational power per operation.

GoogLeNet also helped shift convolutional network design away from simple sequential stacks toward modular, multi-branch architectures. Its Inception modules encoded the idea that visual features should be processed at multiple spatial scales inside the same layer, while still remaining computationally tractable. That made it newly plausible to train much deeper models for large-scale recognition and detection without requiring a proportional increase in memory and computation.

The paper sits between AlexNet’s demonstration that deep CNNs could dominate ImageNet and later breakthroughs such as residual networks, densely connected networks, and modern efficient architectures. Inception did not solve the optimization problems of very deep networks as directly as ResNet later would, but it established that hand-designed architectural topology could be a decisive source of progress. Its influence is visible in subsequent work on factorized convolutions, bottleneck layers, neural architecture search, and the broader practice of treating model architecture as a primary lever for scaling accuracy under real hardware constraints.

Abstract

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

Sources