Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.5 Working With Data Frames in R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
1.5 Working With Data Frames in R
Now for the moment you all have been waiting for: It’s time to work with real data in R! For handling datasets, R has a special object type called a data frame. Data frames look like this:
TableID Tip Condition
1 1 39 Control
2 2 36 Control
3 3 34 Control
4 42 21 Smiley Face
5 43 21 Smiley Face
6 44 17 Smiley Face
This is the first six rows of a data frame called TipExperiment
. The full data frame is from an experiment that randomly assigned tables at a restaurant to receive checks that either included smiley faces (Smiley Face
) or didn’t include smiley faces (Control
). Each row represents a different table from the experiment. The researchers recorded how much each table tipped as a percentage of their total check (e.g., a table may have tipped 17% of their total).
The rows in a data frame represent the cases sampled, with each row being a single case. In the TipExperiment
data frame, the cases (which are sometimes called observations) are tables. Depending on the study, the rows could be people, states, couples, mice – any cases you take a sample of in order to collect data.
The columns of the data frame (labeled TableID
, Tip
, and Condition
) represent variables, or the attributes of each case that could vary from row to row.
This dataset is organized in a “tidy” format (a term coined by statistician Hadley Wickham). It’s generally good practice to format our datasets in a tidy way (“keep things tidy”). The key aspects of a tidy dataset are:
- Each row is an observation (or case)
- Each column is a variable
- Each cell contains a value for the particular observation and variable
Later in the course, we’ll analyze this data to determine if there’s convincing evidence that smiley faces result in higher tips. You can read more about the data here: TipExperiment R documentation.
Peeking at a Data Frame
As with any object in R, you can just type the name of the data frame to see the whole thing.
In the code block below, type the name of the data frame TipExperiment
and then Run.
require(coursekata)
# Try typing TipExperiment to see what is in the data frame.
# Try typing TipExperiment to see what is in the data frame.
TipExperiment
ex() %>% check_output_expr("TipExperiment")
Be sure to scroll up to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data frames, which often include many rows and many columns. Here, we don’t just have data from one table—we have a bunch of tables, each with their own values for different variables.
head()
and tail()
. It’s often useful to take a quick peek at your data frame without printing out the whole thing. One way to do this is with the head()
command.
Press the head(TipExperiment)
.
require(coursekata)
# Run this code to get the first 6 rows of TipExperiment
head(TipExperiment)
# Run this code to get the first 6 rows of TipExperiment
head(TipExperiment)
ex() %>% check_function("head") %>%
check_result() %>% check_equal()
TableID Tip Condition
1 1 39 Control
2 2 36 Control
3 3 34 Control
4 4 34 Control
5 5 33 Control
6 6 31 Control
The head()
function (or command) prints out just the first six rows of the data frame. (You can also try the tail()
function, which prints the last six rows.)
str()
and glimpse()
. These functions show the overall structure of the data frame, including the number of observations, number of variables, names of variables and so on. We often use str()
or glimpse()
when first exploring a new data frame, just to see what’s in it.
Run glimpse(TipExperiment)
and look at the results.
require(coursekata)
# Use glimpse() to see what’s in TipExperiment
# Use glimpse() to see what’s in TipExperiment
glimpse(TipExperiment)
ex() %>% check_function("glimpse") %>%
check_result() %>% check_equal()
Rows: 44
Columns: 3
$ TableID <int> 22, 44, 21, 20, 18, 19, 42, 43, 17, 41, 16, 40, 38, 39, 15, …
$ Tip <dbl> 8, 17, 18, 20, 21, 21, 21, 21, 22, 22, 23, 23, 24, 24, 25, 2…
$ Condition <fct> Control, Smiley Face, Control, Control, Control, Control, Sm…
dataframe$variable. Notice in the output above there is a $
in front of each variable name (in front of TableID
, Tip
, and Condition
). In R, $
is often used to indicate that what follows is a variable name. If you want to specify the Tip
variable in the TipExperiment
data frame, for example, you would write TipExperiment$Tip
. (R has its own way of categorizing variables, such as int, num, and factor. You will learn more about these later.)
Try using the $
to tell R to look in the TipExperiment
data frame to get the contents of the variable Condition
.
require(coursekata)
# Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame
# Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame
TipExperiment$Condition
ex() %>% check_output_expr(
"TipExperiment$Condition",
missing_msg = "Have you used $ to select the Condition variable in TipExperiment?"
)
Control Control Control Control Control Control
Control Control Control Control Control Control
Control Control Control Control Control Control
Control Control Control Control Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face
Using brackets to refer to specific rows. To refer to a specific row of a data frame you can use the brackets after the name of the data frame, similar to what we did before with vectors. For example: TipExperiment[1, ]
will print out the first row of the TipExperiment data frame. (Inside the brackets the order is “row”,“column”. By leaving out the column value it prints all the columns.)
Using the brackets, you can also find the rows that meet certain conditions. What do you think this code will do??
TipExperiment[TipExperiment$Condition == "Control", ]
It will print out all the rows in which the variable Condition
is equal to Control. Try it out in the window below.
require(coursekata)
# Print out all the rows in which the variable Condition is equal to "Control"
# Print out all the rows in which the variable Condition is equal to "Control"
TipExperiment[TipExperiment$Condition == "Control", ]
ex() %>% check_output_expr(
'TipExperiment[TipExperiment$Condition == "Control", ]',
missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end"
)
You can also add and (&) or or (|) inside the brackets. For example, if you wanted to find all the tables that tipped greater than 40 or less than 5 percent, we could write:
TipExperiment[TipExperiment$Tip > 40 | TipExperiment$Tip < 5, ]
Note: To find the | symbol on your keyboard, look above the return key or near the bracket ([ ], { }) keys.
See if you can figure out in the window below how to print out all the rows in which the tables are both in the “Smiley Face” condition and also tipped less than 20%.
require(coursekata)
# Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent
# Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent
TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ]
ex() %>% check_output_expr(
'TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ]',
missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end"
)
We see there is only one table that fits this description.
TableID Tip Condition
2 44 17 Smiley Face
tally()
. It might be useful to be able to count up how many tables were in each condition (e.g., how many tables were in the “Smiley Face” condition versus the “Control” condition). We can use the tally()
function to create a frequency table for a particular variable.
This line of R code will produce a frequency table for the Condition
variable (in the TipExperiment
data frame).
tally(TipExperiment$Condition)
Alternatively, we could also specify the variable and data frame separately like this:
tally(~ Condition, data = TipExperiment)
Both ways of writing tally()
will result in a frequency table that tallies up how many tables were in each condition. (Notice this time we had to put a tilde, ~
, in front of the variable name. This is required when we include data=
as an argument.)
Condition
Control Smiley Face
22 22
We can see from the output that the two experimental groups (smiley face and control) are balanced in size: 22 restaurant tables were assigned to each condition.
arrange()
. Let’s turn our attention to the outcome the researchers were interested in: Tip
. What was the lowest percentage tipped by any of the tables? One way we could answer this question would be to sort the dataset by Tip
, from low to high, using the arrange()
function.
arrange(TipExperiment, Tip)
Importantly, when you arrange a data frame based on the values of one variable (e.g., Tip
), it sorts whole rows, not just the column for that one variable. This ensures that the data for each table ( TableID
, Tip
, and Condition
) stays together as the tables are re-arranged from lowest to highest tip percentages.
If you want to save the data frame after you sort the rows into a new order you can use the assignment operator (<-
). See if you can edit the code below to save the version of TipExperiment
that is arranged by Tip
back into TipExperiment
. Then print out the first six lines of TipExperiment
using head()
.
require(coursekata)
# save TipExperiment, arranged by Tip, back to TipExperiment
arrange(TipExperiment, Tip)
# write code to print out the first 6 rows of TipExperiment
# save TipExperiment, arranged by Tip, back to TipExperiment
TipExperiment <- arrange(TipExperiment, Tip)
# write code to print out the first 6 rows of TipExperiment
head(TipExperiment)
no_save <- "Make sure to both `arrange()` `TipExperiment` by `Tip` *and* save the arranged data frame back to `TipExperiment`."
ex() %>% {
check_object(., "TipExperiment") %>% check_equal(incorrect_msg = no_save)
check_function(., "head") %>% check_result() %>% check_equal()
}
Notice that now the tables are arranged from the lowest to higher tipping tables.
TableID Tip Condition
1 22 8 Control
2 44 17 Smiley Face
3 21 18 Control
4 20 20 Control
5 18 21 Control
6 19 21 Control
The function arrange()
can also be used to arrange values in descending order by adding desc()
around the variable name. If we added the function desc()
(as in the code below), the highest tipping tables would be at the top.
arrange(TipExperiment, desc(Tip))