Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.9 The Structure of Data

Data can come to us in many forms. If you collect data yourself, you may start out with numbers written on scraps of paper. Or you may get a computer file filled with numbers and words of various sorts, each representing the value of some sampled object on some variable of interest.

Regardless of how the data start out, it is necessary to organize and format data so that they are easy to analyze using statistical software. There is no one way to organize data, but there is a way that is most common (called “tidy data”), and that is what we recommend you use. Understanding tidy data will help you figure out how a lot of data sets out there in the world are organized, allowing you to collaborate with others and analyze data from a wide variety of sources!

Statistician Hadley Wickham came up with the concept of tidy data, a way of organizing data into rectangular tables, with rows and columns, according to the following principles:

  1. Each column is a variable
  2. Each row is an observation (or, we have been calling it a case or an object to which a measure is attached)
  3. Each type of observation (or case) is kept in a different table (more on this below)

Rectangular tables of this sort are represented in R using a data frame. The columns are the variables; this is where the results of measures are kept. The rows are the cases sampled. Data frames provide a way to save information such as column headings (i.e., variable names) in the same table as the actual data values.

Principle 3 above simply states that the types of observations that form the rows cannot be mixed within a single table. So, for example, you wouldn’t have rows of college students intermixed with rows of cars or countries or couples. If you have a mix of observation types (e.g., students, families, countries), they each go in a different table.

Sometimes you want to focus on a subset of your variables in a data frame. For example, you might want to look at just the variables Sex and Thumb in the Fingers data frame. The output would be easier to read if it only included a small number of variables.

We can use the select() function to look at just a subset of variables. When using select(), we first need to tell R which data frame, then which variables to select from that data frame.

select(Fingers, Sex, Thumb)

Run the code below to see what it will do.

require(coursekata) # Run this code select(Fingers, Sex, Thumb) # Run this code select(Fingers, Sex, Thumb) ex() %>% check_output_expr("select(Fingers, Sex, Thumb)")

You may need to scroll the output up and down to see it all. It’s quite a lot because the function select() will print out all the values of the selected variables. What the select() function actually does is return a new data frame with the selected subset of columns.

If you want to look at just a few rows of a few variables, we can combine head() and select() together, like this:

select(Fingers, Sex, Thumb) going inside head()

require(coursekata) # Write the code select(Fingers, Sex, Thumb) inside of head() # as shown in the .gif above head() head(select(Fingers, Sex, Thumb)) ex() %>% check_or( check_correct( check_function(., "head") %>% check_result() %>% check_equal(), check_correct( check_function(., "select") %>% check_result() %>% check_equal(), check_function(., "select") %>% { check_arg(., ".data") %>% check_equal(incorrect_msg = "Did you specify the Fingers data frame?") check_arg(., "...", arg_not_specified_msg = "Did you include the column names?") %>% check_equal(incorrect_msg = "Did you select the Sex and Thumb columns?") } ) ), override_solution(., "head(select(Fingers, Thumb, Sex))") %>% check_correct( check_function(., "head") %>% check_result() %>% check_equal(), check_correct( check_function(., "select") %>% check_result() %>% check_equal(), check_function(., "select") %>% { check_arg(., ".data") %>% check_equal(incorrect_msg = "Did you specify the Fingers data frame?") check_arg(., "...", arg_not_specified_msg = "Did you include the column names?") %>% check_equal(incorrect_msg = "Did you select the Thumb and Sex columns?") } ) ) )
     Sex Thumb
1   male 66.00
2 female 64.00
3 female 56.00
4   male 58.42
5 female 74.00
6 female 60.00

The select() function lets us look at a subset of variables. But sometimes you might want to look at a subset of observations. Notice the first person in the Fingers data frame has a thumb that is 66 mm long. Is he the only person with a 66 mm thumb? Let’s try to take a look at all the students who have a thumb length of 66.

select() gives you a subset of variables (or columns of the data frame). To get a subset of observations (or rows of the data frame) we use a different function: filter(). This function filters the data frame to show only those observations that match some criteria. For example, here is the code that will return only the observations where the thumb length is 66 mm:

filter(Fingers, Thumb == 66)

require(coursekata) # Run this code filter(Fingers, Thumb == 66) filter(Fingers, Thumb == 66) ex() %>% check_output_expr("filter(Fingers, Thumb == 66)")
     Sex RaceEthnic FamilyMembers SSLast Year           Job
1   male      Asian             7     NA    3   Not Working
2 female      White             4      6    2 Part-time Job
                 MathAnxious            Interest GradePredict Thumb Index
1                      Agree         No Interest          3.3    66    79
2 Neither Agree nor Disagree Somewhat Interested          3.7    66    69
  Middle Ring Pinkie Height Weight
1     84   74     57   70.5    188
2     77   72     58   63.5    115

The function filter(), like select(), returns a data frame. In this case, the data frame only has two rows because only two observations in Fingers had thumbs that were 66 mm long.

One challenge for students is to keep track of the difference between an observation (e.g., students, represented in rows), a variable (e.g., Thumb or Sex, represented in columns), and the values a variable can take (e.g., 66, or male, represented in cells). It is helpful to imagine the rows and columns of a data frame when you read about observations and variables, respectively. If the data are tidy, the rows will always be observations and the columns, variables.

In this course we will be providing most of the data you analyze in a tidy format. You’ve already been using this format for a bit as we explore data. But now we are making it explicit. However, the world is not always tidy. One day, in the wild world outside of this textbook, you may have to transform a non-tidy data set into a tidy one.

Loading Your Own Data Into a Jupyter Notebook: In these pages, we have pre-loaded most of the data sets we use into the code windows. But you may want to import your own data into a Jupyter notebook. One simple way to do this is to import the data from a Google Sheet. Instructions for how to do this are included in the Resources folder at the end of the book.

Responses