Squeeze-and-Excitation Networks¶
Why this mattered¶
Squeeze-and-Excitation Networks mattered because it shifted attention in CNN design from making spatial processing deeper, wider, or more carefully factorized to explicitly modeling relationships between feature channels. The SE block made “channel attention” a simple architectural primitive: squeeze global spatial information into a channel descriptor, then excite or suppress channels through learned gates. This showed that a convolutional network did not have to treat all channels produced by a layer as equally useful for every input; it could dynamically recalibrate them with only modest extra computation.
The practical impact was unusually immediate. SE blocks could be inserted into existing backbones such as ResNet and Inception-style models, improving accuracy without requiring a wholesale redesign of the network. Their role in the winning ILSVRC 2017 classification system, with a reported 2.251% top-5 error, made the point visible at ImageNet scale: small, modular attention mechanisms could deliver state-of-the-art gains in mature CNN architectures.
Historically, SE helped normalize attention as a general-purpose component of vision models before transformer-based vision systems became dominant. Later architectures and variants extended the same principle in many directions: channel attention, spatial attention, efficient attention modules, and hybrid CNN-attention blocks. Its lasting contribution was not just a better ImageNet model, but a reusable idea: representation quality could be improved by learning which internal features to emphasize, making adaptive feature selection a standard part of modern visual architecture design.
Abstract¶
The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet.
Related¶
- cite → Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Squeeze-and-Excitation Networks cite Faster R-CNN as a downstream object-detection framework where SE feature recalibration can improve visual recognition.
- cite → Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification — Squeeze-and-Excitation Networks build on PReLU/rectifier advances as part of the deep CNN design space for improving ImageNet accuracy.
- cite → Long Short-Term Memory — Squeeze-and-Excitation Networks relate to LSTM through the shared gating mechanism that adaptively modulates information flow.
- cite → Going deeper with convolutions — Squeeze-and-Excitation Networks cite GoogLeNet as an influential convolutional architecture showing that network modules can improve deep visual recognition.
- cite → ImageNet Large Scale Visual Recognition Challenge — Squeeze-and-Excitation Networks use the ImageNet Large Scale Visual Recognition Challenge as the standard benchmark for classification performance.
- cite → A model of saliency-based visual attention for rapid scene analysis — Squeeze-and-Excitation Networks connect to saliency-based visual attention through the idea of selectively emphasizing informative visual features.
- cite → ImageNet classification with deep convolutional neural networks — Squeeze-and-Excitation Networks cite AlexNet as the landmark deep CNN that established large-scale ImageNet classification as a core benchmark.
- cite → Deep Residual Learning for Image Recognition — Squeeze-and-Excitation Networks insert channel-wise feature recalibration blocks into residual architectures introduced by ResNet.
- enables ← Long Short-Term Memory — LSTM introduced gated multiplicative control of information flow, a mechanism echoed by squeeze-and-excitation blocks that gate convolutional feature channels.
- enables ← A model of saliency-based visual attention for rapid scene analysis — Itti, Koch, and Niebur's saliency model formalized attention as selective feature weighting, enabling the channel-attention idea used in squeeze-and-excitation networks.