Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentStatistics and Data Science: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Digging Deeper into Group Models
• segmentChapter 9 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 10 - The Logic of Inference
• segmentChapter 11 - Model Comparison with F
• segmentChapter 12 - Parameter Estimation and Confidence Intervals
• segmentFinishing Up (Don't Skip This Part!)
• segmentResources

list High School / Advanced Statistics and Data Science I (ABC)

Book
• High School / Advanced Statistics and Data Science I (ABC)
• High School / Statistics and Data Science I (AB)
• High School / Statistics and Data Science II (XCD)
• College / Statistics and Data Science (ABC)
• College / Advanced Statistics and Data Science (ABCD)
• College / Accelerated Statistics and Data Science (XCDCOLLEGE)

We have developed the idea of the mean being the simplest (or empty) model of the distribution of a quantitative variable, represented in this word equation:

DATA = MEAN + ERROR

If this is true, then we can calculate error in our data set by just moving components of this equation around to get the formula:

ERROR = DATA - MEAN

Using this formula, if someone has a thumb length larger than the mean (e.g., 62 versus a mean of 60.1), then their error is a positive number (in this case, nearly +2). If they have a thumb length lower than the mean (e.g., 58) then we can calculate their error as a negative number (e.g. about -2).

We generally call the error calculated this way as the residual. Now that you know how to generate predictions, we’ll refine our definition of the residual to be the difference between our model’s prediction and an actual observed score. The word residual should evoke the stuff that remains because the residual is the leftover variation from our data once we take out the model.

To find these errors (or residuals) you can just subtract the mean from each data point. In R we could just run this code to get the residuals:

Fingers$Thumb - Fingers$Predict

If we run the code, R will calculate the 157 residuals, but it won’t save them unless we tell it to do so. Modify the code in the window below to save the residuals in a new variable in Fingers called Resid. (Note that the variable Predict already exists in the Fingers data frame).

require(coursekata) Fingers$TinySet <- c(1,1,1,0,0,0,1,0,0,1, rep(0,147)) Fingers$TinySet[142] <- 1 Fingers <- arrange(arrange(Fingers, Height), desc(TinySet)) empty_model <- lm(Thumb ~ NULL, data = Fingers) Fingers <- Fingers %>% mutate( Predict = predict(empty_model), Resid = Thumb - Predict ) # modify this to save the residuals from the empty_model Fingers$Resid <- # this prints selected variables from Fingers select(Fingers, Thumb, Predict, Resid) Fingers$Resid <- Fingers$Thumb - Fingers$Predict ex() %>% check_object("Fingers") %>% check_column("Resid") %>% check_equal()
CK Code: B1_Code_Thinking_01
 Thumb  Predict     Resid
1    52 60.10366 -8.103662
2    56 60.10366 -4.103662
3    64 60.10366  3.896338
4    70 60.10366  9.896338
5    66 60.10366  5.896338
6    62 60.10366  1.896338

These residuals (or “leftovers”) are so important in modeling that there is an even easier way to get them in R. The function resid(), when given a model (e.g., empty_model) will return all the residuals from the predictions of the model.

resid(empty_model)

Modify the following code to save the residuals that we get using the resid() function as a variable in the Fingers data frame. Call the new variable EasyResid.

require(coursekata) Fingers$TinySet <- c(1,1,1,0,0,0,1,0,0,1, rep(0,147)) Fingers$TinySet[142] <- 1 Fingers <- arrange(arrange(Fingers, Height), desc(TinySet)) empty_model <- lm(Thumb ~ NULL, data = Fingers) Fingers <- Fingers %>% mutate( Predict = predict(empty_model), Resid = Thumb - Predict ) # calculate the residuals from empty_model the easy way # and save them in the Fingers data frame Fingers$EasyResid <- # this prints select variables from Fingers head(select(Fingers, Thumb, Predict, Resid, EasyResid)) Fingers$EasyResid <- resid(empty_model) Fingers ex() %>% check_object("Fingers") %>% check_column("EasyResid") %>% check_equal()
CK Code: Code_Thinking_02
 Thumb  Predict     Resid EasyResid
1    52 60.10366 -8.103662 -8.103662
2    56 60.10366 -4.103662 -4.103662
3    64 60.10366  3.896338  3.896338
4    70 60.10366  9.896338  9.896338
5    66 60.10366  5.896338  5.896338
6    62 60.10366  1.896338  1.896338

Notice that the values for Resid and EasyResid are the same for each row in the data set. We will generally use the resid() function from now on, just because it’s easier, but we want you to know what the resid() function is doing behind the scenes.

Below we have plotted a few of the residuals from the Fingers data set on the Thumb by Height scatterplot. Visually, the residuals can be thought of as the vertical distance between the data (the students’ actual thumb lengths) and the model’s predicted thumb length (60.1).

Note that sometimes the residuals are negative (extending below the empty model) and sometimes positive (above the empty model). Because the empty model is the mean, we know that these residuals are perfectly balanced across the full data set of 157 students.

Distribution of Residuals

Below we’ve plotted histograms of the three variables: Thumb, Predict, and Resid.

The distributions of the data and the residuals have the same shape. But the numbers on the x-axis differ across the two distributions. The distribution of Thumb is centered at the mean (60.1), whereas the distribution of Resid is centered at 0. Data that are smaller than the mean (such as a thumb length of 50) have negative residuals (-10) but data that are larger than the mean (such as 70) have positive residuals (10).

Let’s see what we would get if we summed all values for the variable Fingers$Resid. Try it in the code block below. require(coursekata) empty_model <- lm(Thumb ~ NULL, data = Fingers) Fingers <- Fingers %>% mutate( Predict = predict(empty_model), Resid = resid(empty_model) ) # assume Fingers data frame already has the variable Resid saved in it sum(Fingers$Resid) ex() %>% { check_output_expr(., "sum(Fingers\$Resid)") }
CK Code: Code_Thinking_03
1.70530256582424e-13

R will sometimes give you outputs in scientific notation. The 1.7e-13 is equivalent to $$1.7*10^{-13}$$ which indicates that this is a number very close to zero (the -13 meaning that that decimal point is shifted to the left 13 places)! Whenever you see this scientific notation with a large negative exponent after the “e”, you can just read it as “zero,” or pretty close to zero.

The residuals (or error) around the mean always sum to 0. The mean of the errors will also always be 0, because 0 divided by n equals 0. (R will not always report the sum as exactly 0 because of computer hardware limitations but it will be close enough to 0.)