CourseKata - 3.2 Histograms

High School / Advanced Statistics and Data Science I (ABC)

Book

3.2 Histograms

Statistics provides us with a host of tools we can use for exploring distributions. Many of these tools are visual—e.g., histograms, boxplots, scatterplots, bar graphs, and so on. Being skilled at using these tools to look at distributions is an important part of the statistician’s toolbox—something you can take with you from this course!

We will start with histograms. Histograms are one of the most powerful tools we have for examining distributions. To see how they work, we will create a pretend outcome variable and save it in a vector called outcome. The code we used to create outcome is below; note it’s just 5 numbers. Let’s see how this simple distribution can be visualized with a histogram.

outcome <- c(1, 2, 3, 4, 5)

Making a Histogram from a Vector

There are lots of ways to make histograms in R. We will use the package ggformula to make our visualizations. ggformula is a weird name, but that’s what the authors of this package called it. Because of that, many of the ggformula commands are going to start with gf_; the g stands for the gg part and the f stands for the formula part. We will start by making a histogram with the gf_histogram() command.

Here’s the code to make a histogram of the vector outcome.

gf_histogram(~ outcome)

Note that the ~ (tilde) is usually found to the left of the 1 on most keyboards. The vector (or variable) outcome is placed after the tilde. Try it yourself in the code block below.

require(coursekata)

# This sets up our outcome vector
outcome <- c(1, 2, 3, 4, 5)

# Write code to create a histogram of outcome

# This sets up our outcome vector
outcome <- c(1, 2, 3, 4, 5)

# Write code to create a histogram of outcome
gf_histogram(~outcome)

ex() %>% check_function(., "gf_histogram") %>% {
            check_arg(., "object") %>% check_equal(eval = FALSE, incorrect_msg = "Make sure you specify `~outcome` as the first argument.")
}

A histogram of the distribution of outcome with large gaps between the bars.

The x-axis of the histogram (labeled “outcome”) represents the range of possible values of the outcome variable (in this case, 1 to 5). The variable on the x-axis of a histogram is always quantitative, i.e., measured on a continuous scale.

The y-axis (labeled “count”) represents the frequency of a particular range of scores in a sample. In this case, there is one 1, one 2, one 3, one 4, and one 5. The height of the bars in a histogram represents the number of observations that fall within a certain interval on the outcome variable. The interval is called a bin. The boundaries of the bins are set by dividing the entire range of values into intervals of equal size.

The histogram above shows gaps between the bars because, by default, gf_histogram() tries to make 30 bins (in this case it was able to make 27). But because we only have five possible numbers in our outcome variable, many of these bins are empty. We’ve added black brackets along the bottom of the histogram below to show where the empty bins are.

A histogram of the distribution of outcome with empty bins between the five bars.

If we add some code to tell gf_histogram() to make just 5 bins (after all, we only have the 5 numbers) it will get rid of the gaps between the bars, like this:

gf_histogram(~ outcome, bins = 5)

A histogram of the distribution of outcome. The bins are right next to each other without gaps.

Try editing and running the following code.

require(coursekata)

# This is the same code as before but we added in another outcome value, 3.2
outcome <- c(1, 2, 3, 4, 5, 3.2)

# Try making a histogram with 5 bins
gf_histogram(~ outcome, bins = 30)

# This is the same code as before but we added in another outcome value, 3.2
outcome <- c(1, 2, 3, 4, 5, 3.2)

# Try making a histogram with 5 bins
gf_histogram(~ outcome, bins = 5)

ex() %>% check_object("outcome", undefined_msg = "Make sure not to delete 'outcome'") %>% check_equal(incorrect_msg = "Make sure not to change the content of 'outcome'")

ex() %>% check_function("gf_histogram") %>%
    check_arg(., "object") %>% check_equal()

A histogram of the distribution of outcome after we add a new number, 3.2, to our variable.

The new number (3.2) went into the bin labeled 3, which represents the interval 2.5 to 3.5. The height of that bar (which is now 2) represents the frequency of observations that fall within that interval (both the 3 and 3.2 are in that bin). Below, we’ve annotated the histogram to show which values of outcome fall into each bin.

Histogram annotated with the values of outcome: 1, 2, 3, 3.2, 4,5

Add the number 3.7 to our outcome values. Run the code to see what the histogram would look like then.

require(coursekata)

# add 3.7 to the outcome values, then run this code
outcome <- c(1, 2, 3, 4, 5, 3.2)

# this makes a histogram with 5 bins
gf_histogram(~ outcome, bins = 5)

# add 3.7 to the outcome values, then run this code
outcome <- c(1, 2, 3, 4, 5, 3.2, 3.7)

# this makes a histogram with 5 bins
gf_histogram(~ outcome, bins = 5)

inc_msg = "Don't alter the other code in this exercise -- only the contents of `outcome`."
ex() %>% {
    check_object(., "outcome") %>% check_equal(incorrect_msg = "Did you add 3.7 to the outcome vector?")
    check_function(., "gf_histogram")
}

A histogram of the distribution of outcome after we add a new number, 3.7, to our variable.

The new number, 3.7, was added to the bin labeled 4, which seems to go from 3.5 to 4.5.

Below, we have re-labeled the x-axis to show the boundaries of the bins instead of their centers. If you look closely at the x-axis, you’ll see that the bin previously labeled 4 actually goes from 3.5 to 4.5.

The same histogram as above re-labeled so that the bin 3.2 went into previously labeled 3, now is labeled from 2.5 to 3.5.

You can also adjust the binwidth, or how big the bin is. We can add in binwidth (like bins) as an argument. Here’s an example:

gf_histogram(~ outcome, binwidth = 4)

A histogram of the distribution of outcome after we adjust the binwidth. Now we have two bins instead of five. The first bin ranges from -2 to 2, and the second bin ranges from 2 to 6.

There are two columns because we told gf_histogram() to make the binwidth 4, and the numbers from 1 to 5 won’t fit into a single bin of width 4. R had no choice but to create a second bin. The first bin goes from -2 to 2 and there are only two numbers that go in that bin from our tiny set of outcomes. All the other numbers go in the bin from 2 to 6.

You may have been surprised to see the x-axis go from -2 to +6. After all, none of our numbers were negative. R did this because we put it in a difficult position. It had to include numbers as high as 5, and we required it to have a binwidth of 4. Not all of the numbers could fit within a single bin of width 4, so R had to make two bins of equal intervals. R just does its best to follow your commands!

The binwidth = 4 histogram annotated with the values of outcome.

3.1 The Concept of Distribution 3.3 Visualizing Data With Histograms

Course Outline

High School / Advanced Statistics and Data Science I (ABC)

3.2 Histograms

Making a Histogram from a Vector

Responses

list High School / Advanced Statistics and Data Science I (ABC)

3.2 Histograms

Making a Histogram from a Vector

High School / Advanced Statistics and Data Science I (ABC)