Statistics and Data Science: A Modeling Approach
Chapter 2 - Understanding Data
2.0 Starting With a Bunch of Numbers
When statisticians talk about variation, they refer to a particular kind of variation: variation in data. But variation doesn’t start out as data. Look around; you see people, buildings, trees, light, and so on. And you see lots of variation: no two people look exactly alike, just as no two trees look exactly alike. Statisticians seek to express this variation using numbers, which is where we will start. (In a bit we will discuss where the numbers come from.)
Not all groups of numbers have variation. Take, for example, these numbers: 2, 2, 2, 2, 2, 2, 2, 2, 2. No need to use statistics in this case, because there is no variation. You can just look at the numbers and describe them in a phrase: “Nine twos.” If we said, “What number best represents this distribution of numbers?” you would, almost certainly, say, “Two.”
But take this group of numbers: 2, 1, 3, 3, 2, 3, 1, 2, 1. Now it’s not as easy to describe them—certainly not in a short phrase. And imagine if there were hundreds or thousands of numbers; the challenge would be even greater.
Seeing Patterns in Numbers
Statisticians have, over the years, invented some ideas and some procedures to help us make sense of bunches of numbers. Here’s a simple example. First, see if you can create a vector to store the numbers 2, 1, 3, 3, 2, 3, 1, 2, 1.
In the DataCamp window below, we put in the code to create a vector with nine 2s. We saved it in an R object called myvector1. Now you add the code to create a vector called myvector2 with the numbers 2, 1, 3, 3, 2, 3, 1, 2, 1. (HINT: use the
c() function.) Run the code, then add some code to print out the two vectors just to make sure they ended up with the numbers you intended.
require(tidyverse) require(mosaic) require(supernova)
# Here's how to combine nine 2s into a vector # You could also use rep(2, times = 9) myvector1 <- c(2, 2, 2, 2, 2, 2, 2, 2, 2) # Create a vector called myvector2 with the numbers # 2, 1, 3, 3, 2, 3, 1, 2, 1 myvector2 <- c()
myvector2 <- c(2, 1, 3, 3, 2, 3, 1, 2, 1)
ex() %>% check_object("myvector2") %>% check_equal()
Now, let’s take the numbers in myvector2 and sort them in ascending order. We can use the
sort() function for this.
 1 1 1 2 2 2 3 3 3
Now look at the numbers in myvector2 after we have sorted them. Suddenly it is easier to see a pattern in the variation: there are equal numbers of 1s, 2s, and 3s. Just sorting numbers makes it easier to see a pattern!
If you understand this example, you have just mastered your first statistical technique! It may not look like much, but if you had a bigger data set (instead of nine numbers) you would quickly see the advantages of simply sorting them in order.
We could also represent the same pattern in a frequency table using the command
X 1 2 3 3 3 3
# Here is code to create the vector that we named myvector1 myvector1 <- c(2,2,2,2,2,2,2,2,2) # Now, let's run the tally() function on myvector1
# Here is code to create the vector that we named myvector1 myvector1 <- c(2,2,2,2,2,2,2,2,2) # Now, let's run the tally() function on myvector1 tally(myvector1)
ex() %>% check_function("tally") %>% check_result() %>% check_equal()
X 2 9
Believe it or not, you’ve now learned a second statistical technique—frequency tables (implemented in R as the
tally() function)! As you learn more and more about statistics, you will encounter lots and lots of techniques like this. Fundamentally, they are all variations on just a few core ideas. As you go, and as you build up your statistical power, we will help you keep it all in perspective.