Effective Approaches to Attention-based Neural Machine Translation¶
Why this mattered¶
Luong, Pham, and Manning’s paper mattered because it helped turn attention from a promising add-on for neural machine translation into an engineering pattern that could be systematically designed, compared, and improved. Earlier sequence-to-sequence NMT compressed the source sentence into a fixed vector, which made long sentences difficult; attention relaxed that bottleneck by letting the decoder consult source-side states during generation. This paper showed that several simple scoring functions and alignment strategies could yield large gains, and that attention could be made either global, over the whole sentence, or local, over a predicted window. In doing so, it framed attention not as a single trick but as a family of architectures with measurable tradeoffs in accuracy and efficiency.
The immediate practical consequence was that NMT became more competitive with phrase-based and hybrid statistical systems on major WMT benchmarks. The reported 5.0 BLEU improvement for local attention over a strong non-attentional neural baseline, and the WMT 2015 English-to-German state-of-the-art ensemble result, gave evidence that neural systems could win through better alignment and conditioning rather than through external reranking alone. The paper also made attention easier to analyze: learned alignments could be visualized and compared with linguistic intuitions about translation, helping researchers understand how neural models handled reordering, fertility, and long-distance dependencies.
Historically, this work sits on the path from recurrent encoder-decoder translation to the later Transformer paradigm. It did not introduce attention itself, and it still used recurrent networks, but it clarified design choices that became standard vocabulary: dot, general, and concat-style attention scores; global versus local context; and attention as the mechanism by which sequence models dynamically retrieve relevant information. Subsequent breakthroughs, especially self-attention and the Transformer, generalized this idea further: instead of using attention mainly between decoder states and source annotations, attention became the central computation for representing sequences themselves.
Abstract¶
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker. 1
Related¶
- cite → BLEU — The attention-based NMT paper reports translation quality using the BLEU automatic machine-translation evaluation metric.
- enables ← BLEU — BLEU provided the automatic machine-translation metric used to evaluate the attention-based neural translation models.