Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

1.5 Working With Data Frames in R

Now for the moment you all have been waiting for: It’s time to work with real data in R! For handling datasets, R has a special object type called a data frame. Data frames look like this:

  TableID Tip   Condition
1       1  39     Control
2       2  36     Control
3       3  34     Control
4      42  21 Smiley Face
5      43  21 Smiley Face
6      44  17 Smiley Face

This is the first six rows of a data frame called TipExperiment. The full data frame is from an experiment that randomly assigned tables at a restaurant to receive checks that either included smiley faces (Smiley Face) or didn’t include smiley faces (Control). Each row represents a different table from the experiment. The researchers recorded how much each table tipped as a percentage of their total check (e.g., a table may have tipped 17% of their total).

The rows in a data frame represent the cases sampled, with each row being a single case. In the TipExperiment data frame, the cases (which are sometimes called observations) are tables. Depending on the study, the rows could be people, states, couples, mice – any cases you take a sample of in order to collect data.

The columns of the data frame (labeled TableID, Tip, and Condition) represent variables, or the attributes of each case that could vary from row to row.

This dataset is organized in a “tidy” format (a term coined by statistician Hadley Wickham). It’s generally good practice to format our datasets in a tidy way (“keep things tidy”). The key aspects of a tidy dataset are:

  • Each row is an observation (or case)
  • Each column is a variable
  • Each cell contains a value for the particular observation and variable

Later in the course, we’ll analyze this data to determine if there’s convincing evidence that smiley faces result in higher tips. You can read more about the data here: TipExperiment R documentation.

Peeking at a Data Frame

As with any object in R, you can just type the name of the data frame to see the whole thing.

In the code block below, type the name of the data frame TipExperiment and then Run.

require(coursekata) # Try typing TipExperiment to see what is in the data frame. # Try typing TipExperiment to see what is in the data frame. TipExperiment ex() %>% check_output_expr("TipExperiment")
CK Code: X1_Code_DataFrames_01

Be sure to scroll up to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data frames, which often include many rows and many columns. Here, we don’t just have data from one table—we have a bunch of tables, each with their own values for different variables.

head() and tail(). It’s often useful to take a quick peek at your data frame without printing out the whole thing. One way to do this is with the head() command.

Press the button to see what happens when you run the command head(TipExperiment).

require(coursekata) # Run this code to get the first 6 rows of TipExperiment head(TipExperiment) # Run this code to get the first 6 rows of TipExperiment head(TipExperiment) ex() %>% check_function("head") %>% check_result() %>% check_equal()
CK Code: X1_Code_DataFrames_02
 TableID Tip Condition
1       1  39   Control
2       2  36   Control
3       3  34   Control
4       4  34   Control
5       5  33   Control
6       6  31   Control

The head() function (or command) prints out just the first six rows of the data frame. (You can also try the tail() function, which prints the last six rows.)

str() and glimpse(). These functions show the overall structure of the data frame, including the number of observations, number of variables, names of variables and so on. We often use str() or glimpse() when first exploring a new data frame, just to see what’s in it.

Run glimpse(TipExperiment) and look at the results.

require(coursekata) # Use glimpse() to see what’s in TipExperiment # Use glimpse() to see what’s in TipExperiment glimpse(TipExperiment) ex() %>% check_function("glimpse") %>% check_result() %>% check_equal()
CK Code: X1_Code_DataFrames_03
Rows: 44
Columns: 3
$ TableID   <int> 22, 44, 21, 20, 18, 19, 42, 43, 17, 41, 16, 40, 38, 39, 15, …
$ Tip       <dbl> 8, 17, 18, 20, 21, 21, 21, 21, 22, 22, 23, 23, 24, 24, 25, 2…
$ Condition <fct> Control, Smiley Face, Control, Control, Control, Control, Sm…

dataframe$variable. Notice in the output above there is a $ in front of each variable name (in front of TableID, Tip, and Condition). In R, $ is often used to indicate that what follows is a variable name. If you want to specify the Tip variable in the TipExperiment data frame, for example, you would write TipExperiment$Tip. (R has its own way of categorizing variables, such as int, num, and factor. You will learn more about these later.)

Try using the $ to tell R to look in the TipExperiment data frame to get the contents of the variable Condition.

require(coursekata) # Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame # Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame TipExperiment$Condition ex() %>% check_output_expr( "TipExperiment$Condition", missing_msg = "Have you used $ to select the Condition variable in TipExperiment?" )
CK Code: X1_Code_DataFrames_04
Control     Control     Control     Control     Control     Control    
Control     Control     Control     Control     Control     Control    
Control     Control     Control     Control     Control     Control    
Control     Control     Control     Control     Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face

Using brackets to refer to specific rows. To refer to a specific row of a data frame you can use the brackets after the name of the data frame, similar to what we did before with vectors. For example: TipExperiment[1, ] will print out the first row of the TipExperiment data frame. (Inside the brackets the order is “row”,“column”. By leaving out the column value it prints all the columns.)

Using the brackets, you can also find the rows that meet certain conditions. What do you think this code will do??

TipExperiment[TipExperiment$Condition == "Control", ]

It will print out all the rows in which the variable Condition is equal to Control. Try it out in the window below.

require(coursekata) # Print out all the rows in which the variable Condition is equal to "Control" # Print out all the rows in which the variable Condition is equal to "Control" TipExperiment[TipExperiment$Condition == "Control", ] ex() %>% check_output_expr( 'TipExperiment[TipExperiment$Condition == "Control", ]', missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end" )
CK Code: X1_Code_DataFrames_05

You can also add and (&) or or (|) inside the brackets. For example, if you wanted to find all the tables that tipped greater than 40 or less than 5 percent, we could write:

TipExperiment[TipExperiment$Tip > 40 | TipExperiment$Tip < 5, ]

Note: To find the | symbol on your keyboard, look above the return key or near the bracket ([ ], { }) keys.

See if you can figure out in the window below how to print out all the rows in which the tables are both in the “Smiley Face” condition and also tipped less than 20%.

require(coursekata) # Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent # Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ] ex() %>% check_output_expr( 'TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ]', missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end" )
CK Code: X1_Code_DataFrames_06

We see there is only one table that fits this description.

 TableID Tip   Condition
2      44  17 Smiley Face

tally(). It might be useful to be able to count up how many tables were in each condition (e.g., how many tables were in the “Smiley Face” condition versus the “Control” condition). We can use the tally() function to create a frequency table for a particular variable.

This line of R code will produce a frequency table for the Condition variable (in the TipExperiment data frame).

tally(TipExperiment$Condition)

Alternatively, we could also specify the variable and data frame separately like this:

tally(~ Condition, data = TipExperiment)

Both ways of writing tally() will result in a frequency table that tallies up how many tables were in each condition. (Notice this time we had to put a tilde, ~, in front of the variable name. This is required when we include data= as an argument.)

Condition
    Control Smiley Face 
         22          22

We can see from the output that the two experimental groups (smiley face and control) are balanced in size: 22 restaurant tables were assigned to each condition.

arrange(). Let’s turn our attention to the outcome the researchers were interested in: Tip. What was the lowest percentage tipped by any of the tables? One way we could answer this question would be to sort the dataset by Tip, from low to high, using the arrange() function.

arrange(TipExperiment, Tip)

Importantly, when you arrange a data frame based on the values of one variable (e.g., Tip), it sorts whole rows, not just the column for that one variable. This ensures that the data for each table ( TableID, Tip, and Condition) stays together as the tables are re-arranged from lowest to highest tip percentages.

If you want to save the data frame after you sort the rows into a new order you can use the assignment operator (<-). See if you can edit the code below to save the version of TipExperiment that is arranged by Tip back into TipExperiment. Then print out the first six lines of TipExperiment using head().

require(coursekata) # save TipExperiment, arranged by Tip, back to TipExperiment arrange(TipExperiment, Tip) # write code to print out the first 6 rows of TipExperiment # save TipExperiment, arranged by Tip, back to TipExperiment TipExperiment <- arrange(TipExperiment, Tip) # write code to print out the first 6 rows of TipExperiment head(TipExperiment) no_save <- "Make sure to both `arrange()` `TipExperiment` by `Tip` *and* save the arranged data frame back to `TipExperiment`." ex() %>% { check_object(., "TipExperiment") %>% check_equal(incorrect_msg = no_save) check_function(., "head") %>% check_result() %>% check_equal() }
CK Code: X1_Code_DataFrames_07

Notice that now the tables are arranged from the lowest to higher tipping tables.

 TableID Tip   Condition
1      22   8     Control
2      44  17 Smiley Face
3      21  18     Control
4      20  20     Control
5      18  21     Control
6      19  21     Control

The function arrange() can also be used to arrange values in descending order by adding desc() around the variable name. If we added the function desc() (as in the code below), the highest tipping tables would be at the top.

arrange(TipExperiment, desc(Tip))

Responses