Explaining the 68-95-99.7 rule for a Normal Distribution

This post explains how those numbers were derived in the hope that they can be more interpretable for your future endeavors.
c
comments

By Michael Galarnyk,
68% of the
PDF for a Normal Distribution
 

Let’s simplify it by assuming we have a mean (μ) of 0 and a standard deviation (σ) of 1.


PDF for a Normal Distribution

 

Now that the function is simpler, let’s graph this function with a range from -3 to 3.

# Import all libraries for the rest of the blog post
from scipy.integrate import quad
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3, 3, num = 100)
constant = 1.0 / np.sqrt(2*np.pi)
pdf_normal_distribution = constant * np.exp((-x**2) / 2.0)
fig, ax = plt.subplots(figsize=(10, 5));
ax.plot(x, pdf_normal_distribution);
ax.set_ylim(0);
ax.set_title('Normal Distribution', size = 20);
ax.set_ylabel('Probability Density', size = 20);


The graph above does not show you the probability of events but their probability density. To get the probability of an event within a given range we will need to integrate. Suppose we are interested in finding the probability of a random data point landing within 1 standard deviation of the mean, we need to integrate from -1 to 1. This can be done with SciPy.

# Make a PDF for the normal distribution a function
def normalProbabilityDensity(x):
    constant = 1.0 / np.sqrt(2*np.pi)
    return(constant * np.exp((-x**2) / 2.0) )
# Integrate PDF from -1 to 1
result, _ = quad(normalProbabilityDensity, -1, 1, limit = 1000)
print(result)



Code to integrate the PDF of a normal distribution (left) and visualization of the integral (right).

 

68% of the data is within 1 standard deviation (σ) of the mean (μ).

If you are interested in finding the probability of a random data point landing within 2 standard deviations of the mean, you need to integrate from -2 to 2.


Code to integrate the PDF of a normal distribution (left) and visualization of the integral (right).

 

95% of the data is within 2 standard deviations (σ) of the mean (μ).

If you are interested in finding the probability of a random data point landing within 3 standard deviations of the mean, you need to integrate from -3 to 3.


Code to integrate the PDF of a normal distribution (left) and visualization of the integral (right).

 

99.7% of the data is within 3 standard deviations (σ) of the mean (μ).

It is important to note that for any PDF, the area under the curve must be 1 (the probability of drawing any number from the function’s range is always 1).

You will also find that it is also possible for observations to fall 4, 5 or even more standard deviations from the mean, but this is very rare if you have a normal or nearly normal distribution.

Future tutorials will cover how to take this knowledge and apply it to box plots and confidence intervals, but that is for a later time. If you any questions or thoughts on the tutorial, feel free to reach out in the comments below or through Twitter.

 
Bio: Michael Galarnyk is a data Scientist and Corporate Trainer. He currently works at Scripps Translational Research Institute. You can find him on Twitter (https://twitter.com/GalarnykMichael), Medium (https://medium.com/@GalarnykMichael), and GitHub (https://github.com/mGalarnyk).

Original. Reposted with permission.

Related:

  • Jupyter Notebook for Beginners: A Tutorial
  • Why data Scientists Love Gaussian
  • Descriptive Statistics: The Mighty Dwarf of data Science