Show and tell: A neural image caption generator¶
Why this mattered¶
“Show and Tell” mattered because it helped make image captioning a canonical example of end-to-end neural multimodal learning. Rather than treating vision and language as separate pipelines of object detection, attribute prediction, template filling, and sentence ranking, the paper framed caption generation as conditional sequence modeling: encode an image with a convolutional neural network and decode a sentence with a recurrent language model. This was a direct transfer of ideas from neural machine translation into computer vision, and it showed that a single trainable model could learn both visual grounding and fluent description from paired image-caption data.
The shift was not merely architectural. The paper demonstrated that natural-language output could become a measurable target for visual understanding systems, with large gains on established captioning benchmarks and strong early results on COCO. After this, image captioning became a central testbed for multimodal representation learning: attention mechanisms, bottom-up region features, transformer-based vision-language pretraining, visual question answering, text-to-image retrieval, and later general-purpose multimodal assistants all inherited part of this framing. The important precedent was that images could be embedded into the same generative sequence-modeling paradigm that had begun to transform language tasks.
In retrospect, the model was limited: it used global image features, recurrent decoding, and n-gram metrics such as BLEU that only imperfectly captured semantic correctness. But its influence came from showing a practical route from recognition to description. It made it newly plausible that systems could move beyond labeling what is in an image toward producing open-ended linguistic accounts of visual scenes, a step that became foundational for the broader vision-language models that followed.
Abstract¶
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.
Related¶
- cite → Long Short-Term Memory — Show and Tell uses an LSTM decoder to generate image-caption word sequences from CNN image features.
- cite → BLEU — Show and Tell evaluates generated captions using the BLEU machine-translation precision metric.
- cite → ImageNet Large Scale Visual Recognition Challenge — Show and Tell uses ImageNet-trained convolutional networks as the visual feature extractor for image captioning.
- cite ← Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization — Grad-CAM applies gradient-based visual localization to image-captioning models such as Show-and-Tell to explain generated captions.
- enables ← Long Short-Term Memory — LSTM supplied the recurrent sequence decoder that Show and Tell used to generate captions from image features.
- enables ← BLEU — BLEU provided an automatic n-gram translation metric that Show and Tell adapted to evaluate generated image captions.