list

Statistics and Data Science: A Modeling Approach

10.5 Interpreting Confidence Intervals

To summarize where we are: confidence intervals are constructed around our parameter estimate—in the case of the empty model, the sample mean. What a 95% confidence interval tells us is that there is a 95% likelihood that the interval contains the true parameter (so, the mean of the population). The interval is typically symmetrical with respect to the sample mean, extending the same distance below the sample mean as it does above it.

The size of a confidence interval tells us how much fluctuation there is in our parameter estimate. It can be expressed in the original units of measurement (e.g., mm) or in terms of number of standard errors above and below the mean. The larger the standard error, the wider the confidence interval. We also realized that the confidence interval (because it is dependent on the standard error) is determined in part by 1) our degrees of freedom, and 2) the standard deviation of the population.

Units of the Confidence Interval

The actual size of the 95% confidence interval is in the units of the estimate. In the case of the empty model of thumb length, the 95% confidence interval is shown below.

confint(Empty.model)
               2.5 %   97.5 %
(Intercept) 58.72794 61.47938

Try computing the 95% confidence interval for PoundsLost by housekeepers in the MindsetMatters data frame. Remember, the confidence interval is computed based on model estimates, so fit and print the empty model first.

packages <- c("mosaic", "Lock5withR", "Lock5Data", "supernova", "ggformula", "okcupiddata") lapply(packages, library, character.only = T) MindsetMatters$PoundsLost <- MindsetMatters$Wt2 - MindsetMatters$Wt # Fit and save the empty model for PoundsLost # Print your empty model # Compute the confidence interval around this estimate # Fit and save the empty model for PoundsLost Empty.model <- lm(PoundsLost ~ NULL, data = MindsetMatters) # Print your empty model Empty.model # Compute the confidence interval around this estimate confint(Empty.model) ex() %>% { check_object(., "Empty.model") %>% check_equal() check_output_expr(., "Empty.model") check_function(., "confint") %>% check_result() %>% check_equal() }
You'll want to use `lm()` and `confint()`
DataCamp: ch10-17

Compute the margin of error (in pounds) around the estimate of \(b_{0}\) using this confidence interval.

packages <- c("mosaic", "Lock5withR", "Lock5Data", "supernova", "ggformula", "okcupiddata") lapply(packages, library, character.only = T) MindsetMatters <- MindsetMatters %>% mutate(PoundsLost = Wt2 - Wt) # This saves the empty model Empty.model <- lm(PoundsLost ~ NULL, data = MindsetMatters) # Compute the margin of error # This saves the empty model Empty.model <- lm(PoundsLost ~ NULL, data = MindsetMatters) # There are many ways to calculate the margin of error # One way: confint(Empty.model)[[2]] - mean(MindsetMatters$PoundsLost) # Another way: (confint(Empty.model)[[2]] - confint(Empty.model)[[1]]) / 2 ex() %>% check_output_expr("confint(Empty.model)[[2]] - mean(MindsetMatters$PoundsLost)")
Remember, you can find the margin of error by dividing the distance of the confidence interval by 2 or by subtracting the mean from either the upper or lower bound of the confidence interval.
DataCamp: ch10-18

[1] 0.631077

You can think of this confidence interval (-1.7 to -.44) as telling us about the variability in our estimate. Even though the average pounds lost in the sample of housekeepers was -1.07 pounds, we are reasonably confident that the true population mean could be as low as -1.7 and as high as -.44.

Keeping Your Distributions Straight

All this started because our best estimate of the population mean was the sample mean. But we tried to quantify the possible error in our estimate of the population mean. After all, samples do not look exactly like the population they come from.

The 95% confidence interval says that we can be 95% confident that the true population mean (which we don’t know and can never measure) is within a certain range. If this range is really large, then our estimate is not as good; but if it’s smaller, our estimate is better.

The reason we invoked sampling distributions is to model variation in our estimate (in this case the mean) across samples. Sampling distributions help us deal with sampling variation—how much samples from the same population (or Data Generating Process) might vary. We used the sampling distribution to estimate the margin of error—how far off the population mean could be from our estimate.

The sample mean is the best point estimate we have of the population mean. So, we start there in trying to estimate the range of possible population parameters. But we know that our sample could have been a particularly low or high sample mean. The confidence interval helps us keep those possibilities in mind.