Skip to content

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach

Why this mattered

DeLong, DeLong, and Clarke-Pearson made ROC analysis practically comparative. Before this paper, the area under the ROC curve was already understood as a useful summary of diagnostic discrimination, but comparing AUCs from tests applied to the same patients was statistically awkward because the resulting ROC curves are correlated. The paper’s central contribution was to give a distribution-free way to estimate the covariance matrix of those correlated AUCs using generalized U-statistics. That turned a common clinical question, “is this new diagnostic test actually better than the old one on the same subjects?”, into a standard inferential problem rather than an ad hoc comparison of point estimates.

The paradigm shift was methodological portability. The approach did not require assuming a parametric form for test-score distributions, and it applied to two or more ROC curves measured on the same cases and controls. This made ROC comparison usable across medicine, epidemiology, psychology, machine learning evaluation, and later biomarker studies, where paired designs are common and independence assumptions would be wrong. After this paper, AUC was not just a descriptive performance number; it became a quantity that could be compared with valid standard errors, confidence intervals, and hypothesis tests in correlated settings.

Its influence is visible in the way later diagnostic and predictive modeling studies treat ROC analysis as routine statistical infrastructure. Modern software implementations of “DeLong’s test” became a default tool for comparing classifiers, risk scores, imaging markers, genomic signatures, and clinical prediction models. Subsequent breakthroughs in biomarker discovery and machine-learning model evaluation often depended on exactly this kind of paired comparison: showing not merely that a new model had a higher AUC, but that the improvement was statistically supportable under the dependence structure created by evaluating models on the same individuals.

Abstract

Methods of evaluating and comparing the performance of diagnostic tests are of increasing importance as new tests are developed and marketed. When a test is based on an observed variable that lies on a continuous or graded scale, an assessment of the overall value of the test can be made through the use of a receiver operating characteristic (ROC) curve. The curve is constructed by varying the cutpoint used to determine which values of the observed variable will be considered abnormal and then plotting the resulting sensitivities against the corresponding false positive rates. When two or more empirical curves are constructed based on tests performed on the same individuals, statistical analysis on differences between curves must take into account the correlated nature of the data. This paper presents a nonparametric approach to the analysis of areas under correlated ROC curves, by using the theory on generalized U-statistics to generate an estimated covariance matrix.

Sources