Skip to content

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Why this mattered

SQuAD mattered because it turned machine reading comprehension from a small, heterogeneous evaluation niche into a large-scale, standardized benchmark with a simple operational target: given a passage, extract the span that answers a natural-language question. Earlier question-answering datasets often relied on cloze-style prediction, synthetic questions, limited domains, or answer selection; SQuAD’s crowdwritten questions over Wikipedia passages made it newly practical to train and compare supervised systems on realistic information-seeking behavior at scale. Its design also made progress measurable with relatively transparent metrics, especially exact match and token-level F1, while the large gap between the 2016 logistic-regression baseline and human performance framed reading comprehension as both tractable and unsolved.

The paper helped shift NLP toward dataset-driven progress on contextual understanding. Because every answer was grounded in a passage span, researchers could build neural models that learned alignment between question words, passage words, and local context without needing an external knowledge base. This created fertile ground for attention-based architectures, bidirectional encoders, and pretrained language models: many influential systems in the years after SQuAD, including BiDAF-style span predictors and later BERT-like models, used SQuAD as a central proving ground for whether representations could support precise, context-sensitive extraction rather than only sentence classification or language modeling.

Its longer-term importance is also visible in how quickly the benchmark was surpassed and then revised. SQuAD 1.1 exposed that large annotated benchmarks could accelerate progress dramatically, but also that high scores on extractive span selection did not equal general comprehension. The later SQuAD 2.0 extension, which added unanswerable questions, reflected this lesson by forcing systems to decide when a passage did not contain an answer. In that sense, SQuAD was paradigm-shifting not because it solved machine comprehension, but because it made the problem concrete, scalable, and competitive enough to drive the next wave of neural and pretrained-language-model breakthroughs.

Abstract

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com

Sources