Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
6.11 The Empirical Rule
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
6.11 The Empirical Rule
The cool thing about normal distributions is that they all basically follow this pattern. In the smooth perfect version of the normal distribution (i.e., the theoretical probability distribution), Zone 1 covers about .68, Zone 2 covers .95, and Zone 3 covers .997. This .68-.95-.997 pattern is called the empirical rule.
The empirical rule tells us:
Approximately 68 percent of the scores in a normal distribution are within one standard deviation, plus or minus, of the mean.
Approximately 95 percent of the scores are within two standard deviations.
Approximately 99.7 percent of scores are within three standard deviations of the mean (in other words, almost all of them).
The smooth normal distribution is something that is so perfect that it doesn’t really exist. It’s a mathematical object, kind of like how there are straight lines in the world, but a mathematical straight line is this perfect thing that has no mass, no jitter, and goes on forever. In the same way, a mathematical normal distribution is perfect with no mass, no jitter, and it goes on forever.
The tails of the normal distribution never quite hit 0, they just go on forever and ever. This is why the normal distribution is sometimes called asymptotic. This feature is important because it allows us to predict the very tiny probabilities of very unlikely events such as a person with a thumb length of 1,000 mm.
You probably have never even heard of a thumb so long. But, if we assume the normal probability distribution, we could quantify exactly how low the probability would be of finding such a rare event.
You can try making up a standard deviation for your own game (we’ll call it Zargle) and simply run the code. It will show you the histograms and proportions for the three zones. Try some different standard deviations to try and break the empirical rule.
require(coursekata)
simulate_scores <- function(game, n, mean, sd) {
scores <- rnorm(n, mean, sd)
z <- (scores - mean) / sd
interval <- ifelse(z > 0, trunc(1 + z), trunc(z - 1))
data.frame(game = game, scores = scores, z = z, interval = interval, zone = abs(interval))
}
compare_score_distributions <- function(sd = 3500, mean = 35000, n = 1000, ..., .seed = 5) {
set.seed(.seed)
kargle <- simulate_scores("Kargle", 1000, 35000, 5000)
bargle <- simulate_scores("Bargle", 1000, 35000, 1000)
zargle <- simulate_scores("Zargle", n, mean, sd)
games <- vctrs::vec_c(kargle, bargle, zargle)
# combine all zones > 3 into a single "outside 3" zone
games$zone <- ifelse(games$zone > 3, "outside 3", games$zone)
# convert the proportions to cumulative proportions for all except "outside 3"
props <- data.frame(tally(zone ~ game, data = games, format = "proportion"))
props <- purrr::map_dfr(split(props, props$game), function(x) {
x$Freq <- c(cumsum(x$Freq[1:3]), x$Freq[4])
x
})
# re-format the table to be wide (one column per game)
zone_table <- tidyr::pivot_wider(props, names_from = game, values_from = Freq)
gf_histogram(~scores, fill = ~zone, data = games, bins = 160, alpha = .8) %>%
gf_facet_grid(game ~ .) %>%
print()
data.frame(zone_table)
}
# change the standard deviation to whatever you'd like it to be
# try to break the empirical rule!
compare_score_distributions(sd = 3500, mean = 35000, n = 1000)
# just run the function a few times with different SDs; no solution
ex() %>% check_error()
This is what we would get for the Zargle distribution if the standard deviation was set for 3,500.
zone Bargle Kargle Zargle
1 1 0.686 0.690 0.675
2 2 0.950 0.948 0.944
3 3 0.998 0.996 0.997
4 outside 3 0.002 0.004 0.003
The empirical rule can be very useful when trying to make a quick interpretation of a specific score. If a friend has a baby and tells you it was 54 cm long, how would you interpret that measurement? As an experienced statistician, you should ask: what is the mean, and what is the standard deviation, of the distribution of baby length at birth?
As it turns out, the mean baby length is roughly 50 cm, and the standard deviation is 2 cm. Using the empirical rule, you would say, “Wow! Your baby is like two standard deviations above the mean! That’s a huge baby! Only .05 of babies are longer than 54 cm (the mean plus two standard deviations). You’ve got yourself a big one!”
Actually, you’d be slightly wrong. (Sorry, I know we set you up!) According to the empirical rule, .95 scores in a normal distribution are within plus or minus two standard deviations from the mean. It follows from this that .05 of the scores are more extreme than this, or outside plus or minus two standard deviations.
But note, in the figure, that if .05 of the scores are outside plus or minus two standard deviations, half of those would be expected to be more than two standard deviations above the mean, and half less than two standard deviations below the mean.
So, only .025 of scores would be higher than two standard deviations above the mean. That baby is even more impressive than we thought! He or she is longer than 97.5% of all babies!
What Counts as Unlikely?
We have seen how modeling the error distribution (in the case of the empty model, the distribution of scores around the mean) can help us to calculate probabilities and make predictions. The problem with a probability, though, is that it’s just a number. It doesn’t tell us what to do. We still have to think about it even after all our fancy R code calculations.
For example, if we wanted to use a model of finger lengths to design stretchy one-size-fits-all gloves, how big should we make the gloves? After all, even though very long thumbs are unlikely, they are still possible. But if we make these gloves too big, then we’ll alienate short-fingered folks.
What would be the right glove size? To answer questions like this, we have to figure out what are the most likely lengths of people’s fingers, and that means we need to make a judgment call about what “likely” and “unlikely” mean. We might be able to agree on the best way to estimate a probability, but people will differ on what counts as “unlikely.”
For example, someone who is very risky might look at a .01 probability and say, “Hey! At least it is still possible.” But someone who likes being very certain might say, “Even .40 is unlikely because it’s less likely than a coin toss!” So in being part of a statistics community, it’s helpful to have an agreement about what counts as unlikely.
Statisticians, as a community, have decided to count .05 and lower probabilities as unlikely. So in the case of a DGP that produces a fairly normal population, we would count scores that are outside of Zone 2 (+/- two standard deviations from the mean) as unlikely scores, and the scores within Zone 2 as likely. Note that this decision doesn’t result from a calculation. Human statisticians just sort of agree—yeah, .05 is a pretty low likelihood.