by Bob Horton, Microsoft Data Scientist

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

There are commonly used packages to plot these curves and to compute metrics from them, but it can still be worthwhile to contemplate how these curves are calculated to try to understand better what they show us. Here I present a simple function to compute an ROC curve from a set of outcomes and associated scores.

The calculation has two steps:

- Sort the observed outcomes by their predicted scores with the highest scores first.
- Calculate cumulative True Positive Rate (TPR) and True Negative Rate (TNR) for the ordered observed outcomes.

The function takes two inputs: `labels`

is a boolean vector with the actual classification of each case, and `scores`

is a vector of real-valued prediction scores assigned by some classifier.

Since this is a binary outcome, the labels vector is a series of TRUE and FALSE values (or ones and zeros if you prefer). You can think of this series of binary values as a sequence of instructions for turtle graphics, only in this case the turtle has a compass and takes instructions in terms of absolute plot directions (North or East) instead of relative left or right. The turtle starts at the origin (as turtles do) and it traces a path across the page dictated by the sequence of instructions. When it sees a one (TRUE) it takes a step Northward (in the positive y direction); when it sees a zero (FALSE) it takes a step to the East (the positive x direction). The sizes of the steps along each axis are scaled so that once the turtle has seen all of the ones it will be at 1.0 on the y-axis, and once it has seen all of the zeros it will be at 1.0 on the x-axis. The path across the page is determined by the order of the ones and zeros, and it always finishes in the upper right corner.

The progress of the turtle along the bits of the instruction string represents adjusting the classification threshold to be less and less stringent. Once the turtle has passed a bit, it has decided to classify that bit as positive. If the bit was in fact postive it is a true positive; otherwise it is a false positive. The y-axis shows the true positive rate (TPR), which is the number of true positives encountered so far divided by the total number of actual positives. The x-axis shows the false positive rate (the number of false positives encountered up to that point divided by total number of true negatives). The vectorized implementation of this logic uses cumulative sums (the `cumsum`

function) instead of walking through the values one at a time, though that is what the computer is doing at a lower level.

An ROC “curve” computed in this way is actually a step function. If you had very large numbers of positive and negative cases, these steps would be very small and the curve would appear smooth. (If you actually want to plot ROC curves for large numbers of cases, it could be problematic to plot every point; this one reason that production-grade ROC functions take more than two lines of code.)

As an example, we will simulate data about widgets. Here we have an input feature `x`

that is linearly related to a latent outcome `y`

which also includes some randomness. The value of `y`

determines whether the widget exceeds tolerance requirements; if it does, it is a bad widget.