CourseKata - 6.3 Variance

6.3 Variance

Sum of squares is a good measure of total variation if we are using the mean as a model. But, it does have one important disadvantage.

A faceted histogram, with a histogram of the distribution of outcome within group 1 with the mean on the top, and a histogram of the distribution of outcome within group 2 with the mean at the bottom.

The item Ch6_Sum_5 is not currently available

Although you can see that the spread of the data points does not look different between the two distributions, the one on the bottom (#2) has a much larger SS.

The item Ch6_Sum_6 is not currently available

Sum of squares worked fine as a way to quantify error around the mean, and compare error across two distributions when both distributions had the same sample size. But SS isn’t as easily interpreted when sample sizes vary.

The reason for this is that each time you add another data point to the sample distribution, you are adding another squared deviation from the mean to the total SS. So even if two distributions appear to be equally well modeled by their respective means, they may have very different SS. SS always grows as the number of data points in the distribution gets larger, irrespective of the degree of spread.

This problem is solved by adding two new statistics to our toolbox: variance and standard deviation. To calculate variance, we start with SS, or total error, but then divide by the sample size to end up with a measure of average error around the mean—the average of the squared deviations.

Because it is an average, variance is not impacted by sample size, and thus, can be used to compare the amount of error across two samples of different sizes.

The formula for variance, usually represented as \(s^2\), is this:

\[\frac{\sum_{i=1}^n (Y_i-\bar{Y})^2}{n-1}\]

You can see that the numerator is the sum of squares. Although to get an actual average of squared deviations you would divide by n, we instead divide by n-1. We do this because dividing by n-1 gives us a better estimate of the true population variance, a fact easily demonstrated by simulating multiple random samples from a population of known variance and then seeing which estimates are better – those obtained by dividing by n, or those obtained dividing by n-1.

There is, of course, a mathematical proof for this (for reference, here you can download mathematical proof for n-1 correction (PDF, 347KB)). But we find it helpful to think about this way: when you take a small sample, the most extreme values in a population are unlikely to show up. So, if we divided by n it would, especially in smaller samples, slightly underestimate the true population variance. Dividing by n-1 corrects this bias, making the variance estimate a bit larger. And, as the sample gets larger, the difference between n and n-1 makes less and less difference.

The main thing to know is that taking the SS and dividing by n-1 results in something that approximates an average squared deviation. (Also note: the n-1 you see in the denominator is sometimes called the degrees of freedom, or df. This will be more important later.)

So how do we calculate variance in R? We use var(). Here is how to calculate the variance of our Thumb data from TinyFingers.

var(TinyFingers$Thumb)

[1] 16.4

Try calculating the variance of Thumb from the larger Fingers data frame.

require(coursekata)
empty_model <- lm(Thumb ~ NULL, data = Fingers)

# calculate the variance of Thumb from the Fingers data frame
var()

var(Fingers$Thumb)

ex() %>% check_function("var") %>% check_result() %>% check_equal()

CK Code: ch6-5

[1] 76.1552

6.2 The Beauty of Sum of Squares 6.4 Standard Deviation