Statistics and Data Science: A Modeling Approach

6.2 Variance

Sum of squares is a good measure of total variation if we are using the mean as a model. But, it does have one important disadvantage.

Two graphs showing distributions

Although you can see that the spread of the data points does not look different between the two distributions, the one on the bottom (#2) has a much larger SS.

Sum of squares worked fine as a way to quantify error around the mean, and compare error across two distributions when both distributions had the same sample size. But SS isn’t as easily interpreted when sample sizes vary.

The reason for this is that each time you add another data point to the sample distribution, you are adding another squared deviation from the mean to the total SS. So even if two distributions appear to be equally well modeled by their respective means, they may have very different SS. SS always grows as the number of data points in the distribution gets larger, irrespective of the degree of spread.

This problem is solved by adding two new statistics to our toolbox: variance and standard deviation. To calculate variance, we start with SS, or total error, but then divide by the sample size to end up with a measure of average error around the mean—the average of the squared deviations.

Because it is an average, variance is not impacted by sample size, and thus, can be used to compare the amount of error across two samples of different sizes.

The formula for variance, usually represented as \(s^2\), is this:

\[\frac{\sum_{i=1}^n (Y_i-\bar{Y})^2}{n-1}\]

You can see that the numerator is the sum of squares. Although to get an actual average of squared deviations you would divide by n, we instead divide by n-1. We do this because simulation studies have shown that dividing by n-1 gives us a better estimate of the actual population variance.

The reason for this is that when you take a small sample, the most extreme values in a population are unlikely to show up. So, if we divided by n it would, especially in smaller samples, slightly underestimate the true population variance. Dividing by n-1 corrects this bias, making the variance estimate a bit larger. And, as the sample gets larger, the difference between n and n-1 makes less and less difference. If you want to know more, you can read about this correction here.

The main thing to know is that taking the SS and dividing by n-1 results in something that approximates an average squared deviation. (Also note: the n-1 you see in the denominator is sometimes called the degrees of freedom, or df. This will be more important later.)

So how do we calculate variance in R? We use var(). Here is how to calculate the variance of our Thumb data from TinyFingers.

[1] 16.4

Try calculating the variance of Thumb from the larger Fingers data frame.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) Empty.model <- lm(Thumb ~ NULL, data = Fingers) # calculate the variance of Thumb from the Fingers data frame var() var(Fingers$Thumb) ex() %>% check_function("var") %>% check_result() %>% check_equal()
You can use Fingers$Thumb to use just the Thumb column
DataCamp: ch6-5

[1] 76.1552