By Syed Sadat Nazrul, Analytic Scientist
import numpy as np import matplotlib.pyplot as plt def pdf(x, std, mean): cons = 1.0 / np.sqrt(2*np.pi*(std**2)) pdf_normal_dist = const*np.exp(-((x-mean)**2)/(2.0*(std**2))) return pdf_normal_dist x = np.linspace(0, 1, num=100) good_pdf = pdf(x,0.1,0.4) bad_pdf = pdf(x,0.1,0.6)
Now that we have the distribution, let’s create a function to plot the distributions.
Now let’s use this plot_pdf function to generate the plot:
Now we have the probability distribution of the binary classes, we can now use this distribution to derive the ROC curve.
Deriving ROC Curve
To derive the ROC curve from the probability distribution, we need to calculate the True Positive Rate (TPR) and False Positive Rate (FPR). For a simple example, let’s assume the threshold is at P(X=’bad’)=0.6 .
True positive is the area designated as “bad” on the right side of the threshold. False positive denotes the area designated as “good” on the right of the threshold. Total positive is the total area under the “bad” curve while total negative is the total area under the “good” curve. We divide the value as shown in the diagram to derive TPR and FPR. We derive the TPR and FPR different threshold values to get the ROC curve. Using this knowledge, we create the ROC plot function:
Now let’s use this plot_roc function to generate the plot:
Now plotting the probability distribution and the ROC next to eachother for visual comparison:
Effect of Class Separation
Now that we can derive both plots, let’s see how the ROC curve changes as the class separation (i.e. the model performance) improves. We do this by altering the mean value of the Gaussian in the probability distributions.
As you can see, the AUC increases as we increase the separation between the classes.
Looking Beyond The AUC
Beyond AUC, the ROC curve can also help debug a model. By looking at the shape of the ROC curve, we can evaluate what the model is misclassifying. For example, if the bottom left corner of the curve is closer to the random line, it implies that the model is misclassifying at X=0. Whereas, if it is random on the top right, it implies the errors are occurring at X=1. Also, if there are spikes on the curve (as opposed to being smooth), it implies the model is not stable.
- Data Science Interview Guide -