Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

6.2 Variance

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentChapter 12  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list Statistics and Data Science: A Modeling Approach
6.2 Variance
Sum of squares is a good measure of total variation if we are using the mean as a model. But, it does have one important disadvantage.
Although you can see that the spread of the data points does not look different between the two distributions, the one on the bottom (#2) has a much larger SS.
Sum of squares worked fine as a way to quantify error around the mean, and compare error across two distributions when both distributions had the same sample size. But SS isn’t as easily interpreted when sample sizes vary.
The reason for this is that each time you add another data point to the sample distribution, you are adding another squared deviation from the mean to the total SS. So even if two distributions appear to be equally well modeled by their respective means, they may have very different SS. SS always grows as the number of data points in the distribution gets larger, irrespective of the degree of spread.
This problem is solved by adding two new statistics to our toolbox: variance and standard deviation. To calculate variance, we start with SS, or total error, but then divide by the sample size to end up with a measure of average error around the mean—the average of the squared deviations.
Because it is an average, variance is not impacted by sample size, and thus, can be used to compare the amount of error across two samples of different sizes.
The formula for variance, usually represented as \(s^2\), is this:
\[\frac{\sum_{i=1}^n (Y_i\bar{Y})^2}{n1}\]
You can see that the numerator is the sum of squares. Although to get an actual average of squared deviations you would divide by n, we instead divide by n1. We do this because simulation studies have shown that dividing by n1 gives us a better estimate of the actual population variance.
The reason for this is that when you take a small sample, the most extreme values in a population are unlikely to show up. So, if we divided by n it would, especially in smaller samples, slightly underestimate the true population variance. Dividing by n1 corrects this bias, making the variance estimate a bit larger. And, as the sample gets larger, the difference between n and n1 makes less and less difference. If you want to know more, you can read about this correction here.
The main thing to know is that taking the SS and dividing by n1 results in something that approximates an average squared deviation. (Also note: the n1 you see in the denominator is sometimes called the degrees of freedom, or df. This will be more important later.)
So how do we calculate variance in R? We use var()
. Here is how to calculate the variance of our Thumb
data from TinyFingers
.
var(TinyFingers$Thumb)
[1] 16.4
Try calculating the variance of Thumb
from the larger Fingers
data frame.
require(coursekata)
empty_model < lm(Thumb ~ NULL, data = Fingers)
# calculate the variance of Thumb from the Fingers data frame
var()
var(Fingers$Thumb)
ex() %>% check_function("var") %>% check_result() %>% check_equal()
[1] 76.1552