Statistics and Data Science: A Modeling Approach
9.9 The Central Limit Theorem
We have been simulating to learn about the characteristics of sampling distributions. But this approach would not have been possible before the invention of the modern day computer. Think how long it would take to draw 1,000 random samples of n=157 from a population and calculate the mean of each? We can do that in R in 10 seconds or less. Before, it would have been days of work!
Most of what we have learned, however, has long been known by mathematicians to be true as part of the Central Limit Theorem. The Central Limit Theorem (CLT) describes the shape, center, and spread of a distribution of sample means of equal size when each sample is randomly chosen from some population. And the CLT has been proven; it’s a mathematical law which means that it is true of sampling distributions of means from all kinds of populations—not just the ones we have simulated.
According to the Central Limit Theorem, sampling distributions of means are known to have the following characteristics:
The shape of the distribution of means will typically be normal in shape, provided the sample size is large enough OR if the shape of the population is normal.
The mean of the sampling distribution will be the same as the mean of the population from which the samples are randomly chosen. That is, the sample means will center around the true population mean.
The standard error will be smaller for larger sample sizes. Even more specifically, the standard error will be equal to the population standard deviation divided by the square root of the sample size.
Let’s try and understand this last part of the CLT (the part about the standard deviation of the sampling distribution) a little more.
So even though there is always the problem of sampling variation, we are starting to appreciate that there are situations where our estimate from a sample is better, and times when the estimate is not as good. And it turns out that in general, estimates from samples—especially the mean—will occur around the true population parameter.
If the distribution of the mean looks just like the population distribution, then the standard deviations will have the same value. But as the sample size of our simulated samples grows, the means will vary less and cluster around the population mean. The CLT quantifies this pattern specifically as:
Let’s explore this mathematical pattern.
Here is the table we made from the video (we rounded the calculated standard errors to the nearest hundredth). These were standard errors that were calculated and simulated from a population with a standard deviation (\(\sigma\)) of 5.
|Sample Size of Simulated Samples (n)||Calculated Standard Error (from CLT)||Simulated Standard Error|
Can You Break the CLT?
Think you could break the CLT? Give it a try. Make population distributions with crazy shapes, then simulate sampling distributions of the mean. Are they normal? Is their standard error basically equal to the population standard deviation divided by the square root of n?
As we have simulated, we can make a sampling distribution for any summary statistic. We have looked at mean, variance, and median, but there is no reason to think that F, PRE, and other even fancier statistics couldn’t also have a sampling distribution. The sampling distribution we have focused on the most so far is the sampling distribution of the mean. Does the CLT work for sampling distributions of other statistics?