Working with Confidence Intervals

Working with Confidence Intervals

Source Node: 2610337

Working with Confidence Intervals
Image by Editor
 

In data science and statistics, confidence intervals are very useful for quantifying uncertainty in a dataset. The 65% confidence interval represents data values that fall within one standard deviation of the mean. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value. 

In this article, we illustrate how the confidence interval can be calculated using the heights dataset. The heights dataset contains male and female height data.

First, we generate the probability distribution of the male and female heights.

# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # obtain dataset
df = pd.read_csv('https://raw.githubusercontent.com/bot13956/Bayes_theorem/master/heights.csv') # plot probability distribution of heights
sns.kdeplot(df[df.sex=='Female']['height'], label='Female')
sns.kdeplot(df[df.sex=='Male']['height'], label = 'Male')
plt.xlabel('height (inch)')
plt.title('probability distribution of Male and Female heights')
plt.legend()
plt.show()

 

Working with Confidence Intervals
Probability distribution of male and female heights | Image by Author.
 

From the figure above, we observe that males are on average taller than females.

The code below illustrates how the 95% confidence intervals for the male and female heights can be calculated.

# calculate confidence intervals for male heights
mu_male = np.mean(df[df.sex=='Male']['height'])
mu_male >>> 69.31475494143555 std_male = np.std(df[df.sex=='Male']['height'])
std_male >>> 3.608799452913512 conf_int_male = [mu_male - 2*std_male, mu_male + 2*std_male]
conf_int_male >>> [65.70595548852204, 72.92355439434907] # calculate confidence intervals for female heights
mu_female = np.mean(df[df.sex=='Female']['height'])
mu_female >>> 64.93942425064515 std_female = np.std(df[df.sex=='Female']['height'])
std_female >>> 3.752747269853828 conf_int_female = [mu_female - 2*std_female, mu_female + 2*std_female]
conf_int_female >>> [57.43392971093749, 72.4449187903528]

 

Another method to estimate the confidence interval is to use the interquartile range. A boxplot can be used to visualize the interquartile range as illustrated below.
 

# generate boxplot
data = list([df[df.sex=='Male']['height'], df[df.sex=='Female']['height']]) fig, ax = plt.subplots()
ax.boxplot(data)
ax.set_ylabel('height (inch)')
xticklabels=['Male', 'Female']
ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()

 

 

Working with Confidence Intervals
Box plot showing the interquartile range.| Image by Author.
 

The box shows the interquartile range, and the whiskers indicate the minimum and maximum values of the data, excluding outliers. The round circles indicate the outliers. The orange line is the median value. From the figure, the interquartile range for male heights is [ 67 inches, 72 inches]. The interquartile range for female heights is [63 inches, 67 in]. The median height for males heights is 68 inches, while the median height for female heights is 65 inches.

In summary, confidence intervals are very useful for quantifying uncertainty in a dataset. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.
 
 
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.
 

Time Stamp:

More from KDnuggets