Central Limit Theorem: Visualized in Python

5 min readMar 6, 2020

--

The Central Limit Theorem is of crucial importance in using statistical inference to reach conclusions about a population. It allows you to make inferences about the population mean without having to know the specific shape of the population distribution.

Image source courtesy : https://statisticsbyjim.com/basics/central-limit-theorem/

In many applications, you want to make inferences that are based on statistics calculated from samples to estimate the values of population parameters. Let’s take mean ‘µ’ as one of the population parameters. The idea is to estimate the population parameter, mean (µ) from samples taken from that population.

The Central Limit Theorem (CLT) : As the sample size (the number of values in each sample) gets large enough, the sampling distribution of the mean is approximately normally distributed. This is true regardless of the shape of the distribution of the individual values in the population.

Conditions for CLT

The Central Limit Theorem assumes the following:

Randomization Condition : Each sample should represent a random sample from the population
Independence Assumption: The sample values must be independent of each other. This means that the occurrence of one event has no influence on the next event.
10% Rule : The sample size must not be bigger than 10% of the entire population
Sample Size Assumption: The sample size must be sufficiently large. It depends on the shape of population distribution. The more the population distribution differs from being normal (that is, skewed distribution), the larger the sample size must be. Typically, statisticians say that a sample size of 30 is sufficient for most distributions. However, strongly skewed distributions can require larger sample sizes

Properties of CLT

Normal distributions have two parameters, the mean and standard deviation. What values do these parameters converge on?

As the sample size increases, the sampling distribution converges on a normal distribution where the mean equals the population mean, and the standard deviation equals σ/√n. Where:

σ = the population standard deviation
n = the sample size

Mathematically,

µ X̄ = µ and σ X̄ = σ/sqrt(n)

, where :

µ X̄ is mean of the sampling distribution

µ is the mean of population

σ X̄ is the Standard deviation of sampling distribution

σ is the population standard deviation

n is the sample size

Now let’s explore these properties using python visually☺ Code can be downloaded from here.

Step 1: Create the population data of size 5000 and plot its distribution.

Random distributed population data is shown below along with its QQ plot:

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plt.hist(population,bins=50,color='g')
plt.text(3.8,300,'Mean: {} and Standard Deviation: {}'.format(round(population.mean(),5),round(population.std(),5)),fontsize=8,bbox=dict(facecolor='green', alpha=0.5))
plt.subplot(1, 2, 2)
stats.probplot(population, dist="norm", plot=plt)
plt.show()

population_mean = population.mean()
population_std = population.std()population_mean = round(population_mean,5)
population_std = round(population_std,5)
print('Population mean: {} and Population Standard Deviation: {}'.format(population_mean,population_std))

Population mean (µ) and standard deviation (σ) are 2.38545 and 1.6335 respectively. In next steps, we will estimate the population from its sampling distribution of mean.

Step 2: Plot “sampling distribution of the mean” of different sample sizes (n) to compare the results of each sampling distribution. QQ plot is plotted alongside to test the degree of ‘normality’ of each sampling distribution. If the points on QQ plot is along 45 ° line, then distribution follows normal distribution.

Plotting sampling distribution of mean of different sample sizes

Sampling mean distribution for n = 2 and its QQ Plot

Sampling mean distribution for n = 5 and its QQ Plot

Sampling mean distribution for n = 10 and its QQ Plot

Sampling mean distribution for n = 30 and its QQ Plot

Sampling mean distribution for n = 50 and its QQ Plot

Sampling mean distribution for n = 100 and its QQ Plot

With every increasing sample size (n), the sampling distribution is becoming more normal distributed as suggested by their respective QQ Plots. However, the standard deviation (σ X̄) is decreasing with sample size.

Step 3: Now, let’s make inference about population mean (µ) from sampling distribution of sample size(n) = 30.

Mean (µ X̄) and Standard deviation (σ X̄) of sampling distribution are 2.38887 and 0.14984, when sample size (n) = 30

µ X̄ is equal to µ, which shows that sampling distribution of mean converges to normal distribution with mean (µ X̄) equal to population mean (µ).

Step 3: Observe the pattern of standard deviation (σ X̄) of sampling distribution of different sample sizes

plt.plot(sample_sizes,standard_deviations,'o',label='Standard Deviation for each sample size (σ_x_bar)') plt.plot(sample_sizes,1/np.sqrt(sample_sizes),alpha=0.5,label='1/√(sample size) curve') plt.legend(loc='upper right')

Standard deviations for different sample sizes

We clearly see that standard deviation found for each sampling distribution of varying sample sizes follows the curve (1/√n) as suggested as one of the properties of CLT.

As CLT is the starting point of any inferential statistical course, I hope I am able to explain the concept visually to beginners in data science, who are stepping into inferential statistics for first time. Please don’t forget to give claps as it motivates a lot to write about topics on Data Science

Central Limit Theorem: Visualized in Python

Conditions for CLT

Properties of CLT

References:

Written by Shubham Singh