Skip to content

A Coefficient of Agreement for Nominal Scales

Why this mattered

Cohen’s paper mattered because it turned a common but weak practice, reporting simple percent agreement between judges, into a chance-corrected measurement problem for nominal categories. The coefficient later called Cohen’s kappa estimated how much two raters agreed beyond the agreement expected from their marginal category use, making reliability assessable for diagnoses, content codes, behavioral observations, and other classifications where “correlation” was not the right tool. Its shift was conceptual as much as technical: categorical judgment could now be audited with a single interpretable statistic rather than treated as informal consensus or raw concordance. See Cohen’s original article: SAGE.

That framework became a foundation for later agreement statistics. Cohen himself extended the idea to weighted kappa for ordered or partially creditable disagreements in 1968, and Fleiss generalized the chance-corrected agreement program to many raters in 1971. The same lineage underlies much later work in clinical reliability, survey coding, content analysis, remote sensing accuracy assessment, and machine-learning annotation quality. Kappa also became important because its limitations were visible: dependence on marginal distributions, prevalence effects, and the distinction between agreement and accuracy all became topics of methodological research. In that sense, the paper did not merely introduce a statistic; it helped define inter-rater reliability as a quantitative prerequisite for trustworthy categorical data.

Abstract

(no abstract available)

Sources