Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.3 A Data Frame Example: MindsetMatters
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
2.3 A Data Frame Example: MindsetMatters
The data we looked at on the previous page were selected from a data frame called MindsetMatters
. The full data frame is from a study that investigated the health of 75 female housekeepers from different hotels. You can read more about how these data were collected and organized here: [MindsetMatters R documentation].
A data frame is a kind of object in R, and as with any object, you can just type the name of it to see the whole thing.
Type the name of the data frame MindsetMatters
and then Run.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# Try typing MindsetMatters to see what is in the data frame.
MindsetMatters
ex() %>%
check_output_expr("MindsetMatters")
Be sure to scroll up to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data—there are a whole lot of things in a data set, including a lot of variables and values. And usually we don’t just sample one case (e.g., one housekeeper)—we have a bunch of housekeepers, each with their own values for a bunch of variables. So things get pretty complicated, pretty fast.
It’s always useful to take a quick peek at your data frame. But looking at the whole thing might be a little complicated. So a helpful command is head()
which shows you just the first few rows of a data frame.
Press the <Run> button to see what happens when you run the command head(MindsetMatters)
.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# Run this code to get the first 6 rows of MindsetMatters
head(MindsetMatters)
# Run this code to get the first 6 rows of MindsetMatters
head(MindsetMatters)
ex() %>%
check_function("head") %>%
check_result() %>%
check_equal()
Cond Age Wt Wt2 BMI BMI2 Fat Fat2 WHR WHR2 Syst Syst2 Diast Diast2 Condition
1 0 43 137 137.4 25.1 25.1 31.9 32.8 0.79 0.79 124 118 70 731 Uninformed
2 0 42 150 147.0 29.3 28.7 35.5 NA 0.81 0.81 119 112 80 682 Uninformed
3 0 41 124 124.8 26.9 27.0 35.1 NA 0.84 0.84 108 107 59 653 Uninformed
4 0 40 173 171.4 32.8 32.4 41.9 42.4 1.00 1.00 116 126 71 794 Uninformed
5 0 33 163 160.2 37.9 37.2 41.7 NA 0.86 0.84 113 114 73 784 Uninformed
6 0 24 90 91.8 16.5 16.8 NA NA 0.73 0.73 NA NA 78 764 Uninformed
The head()
function just prints out the first six rows of the data frame as rows and columns.
Sometimes, it’s useful just to get an overview of what’s in the data frame. The function str()
shows us the overall structure of the data frame, including number of observations, number of variables, names of variables and so on. (We often use str()
when first exploring a new data frame, just to see what’s in it.)
Run str()
on MindsetMatters
and look at the results.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# Run this code to see the structure of MindsetMatters
str(MindsetMatters)
str(MindsetMatters)
ex() %>%
check_function("str") %>%
check_result() %>%
check_equal()
'data.frame': 75 obs. of 15 variables:
$ Cond : int 0 0 0 0 0 0 0 0 0 0 ...
$ Age : int 43 42 41 40 33 24 46 21 29 19 ...
$ Wt : int 137 150 124 173 163 90 150 156 141 123 ...
$ Wt2 : num 137 147 125 171 160 ...
$ BMI : num 25.1 29.3 26.9 32.8 37.9 16.5 27.5 25.9 27.5 19.6 ...
$ BMI2 : num 25.1 28.7 27 32.4 37.2 16.8 27.4 25.7 27.4 19.7 ...
$ Fat : num 31.9 35.5 35.1 41.9 41.7 NA 36.1 36.4 NA 26.6 ...
$ Fat2 : num 32.8 NA NA 42.4 NA NA 37.3 NA NA NA ...
$ WHR : num 0.79 0.81 0.84 1 0.86 0.73 0.9 0.78 0.87 0.69 ...
$ WHR2 : num 0.79 0.81 0.84 1 0.84 0.73 0.9 0.78 0.85 0.69 ...
$ Syst : int 124 119 108 116 113 NA 119 116 110 113 ...
$ Syst2 : int 118 112 107 126 114 NA 115 135 115 117 ...
$ Diast : int 70 80 59 71 73 78 75 67 73 75 ...
$ Diast2 : int 73 68 65 79 78 76 77 65 74 72 ...
$ Condition: Factor w/ 2 levels "Informed","Uninformed": 2 2 2 2 2 2 2 2 2 2 ...
Note that there is a $
in front of each variable name. In R, $
is often used to indicate that what follows is a variable name. If you want to specify the Age
variable in the MindsetMatters
data frame, for example, you would write MindsetMatters$Age
. (R has its own way of categorizing variables, such as int, num, and Factor. You will learn more about these later.)
Try using the $
to print out just the variable Age
from MindsetMatters
.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# Use the $ sign to print out the contents of the Age variable in the MindsetMatters data frame
# Use the $ sign to print out the contents of the Age variable in the MindsetMatters data frame
MindsetMatters$Age
ex() %>%
check_output_expr("MindsetMatters$Age", missing_msg = "Have you used $ to select the Age variable in MindsetMatters?")
That’s a lot of numbers! If you want a more organized list, you can sometimes get that by using the print()
function, like this:
print(MindsetMatters$Age)
You can try adding the print()
function in the window above. When you do you get something like this:
[1] 43 42 41 40 33 24 46 21 29 19 41 33 44 48 38 42 38 46 45 35 30 38 41 54 65
[26] 58 29 45 57 61 38 53 45 62 48 50 40 32 54 24 24 52 34 28 31 29 31 34 26 37
[51] 28 44 26 29 47 27 42 39 27 NA 27 48 39 55 26 29 27 33 29 33 31 24 22 23 38
When R is asked to print out a single variable (such as Age
), R prints out each person’s value on the variable all in a row. When it gets to the end of one row it begins again on the next row. In contrast, when R is asked to print out multiple variables, it uses the rows and columns format, where rows are cases and columns are variables.
If you counted the ages printed on the first row, there are 25 of them. The [26]
indicates that the next row starts with the 26th observation.
Let’s use the tally()
function to create a frequency table of the Age
variable (in the MindsetMatters
data frame). This will tell us how many housekeepers there were of each age.
tally(MindsetMatters$Age)
We don’t have to use the $
notation. We could also specify the variable and data frame separately, like this:
tally(~ Age, data = MindsetMatters)
Age
19 21 22 23 24 26 27 28 29 30 31 32 33 34 35 37
1 1 1 1 4 3 4 2 6 1 3 1 4 2 1 1
38 39 40 41 42 43 44 45 46 47 48 50 52 53 54 55
5 2 2 3 3 1 2 3 2 1 3 1 1 1 2 1
57 58 61 62 65 <NA>
1 1 1 1 1 1
The rows that start with 19, 38, and 57 represent the ages of the housekeepers and the numbers underneath them represent how many of each age are in the data frame. For example, there is one housekeeper who is 19 years old. There are two housekeepers who are 54 years old. There are three housekeepers who are 45 years old.
Try using the tally function to make a frequency table of housekeepers by Condition
.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# Use tally() with the MindsetMatters data frame to create a frequency table of housekeepers by Condition
# Use tally() with the MindsetMatters data frame to create a frequency table of housekeepers by Condition
tally(~Condition, data = MindsetMatters)
# Another solution
# tally(MindsetMatters$Condition)
ex() %>%
check_function("tally") %>%
check_result() %>%
check_equal()
The output of tally()
shows us that there are 41 housekeepers who were in the Informed condition and 34 in the Uninformed condition. Taking a look at this frequency table, we might wonder why there were slightly more housekeepers who were informed that their daily work of cleaning was equivalent to getting adequate exercise.
Let’s turn our attention to two variables in the MindsetMatters
data frame: Age
(the age of the housekeepers, in years, at the start of the study) and Wt
(their weight, in pounds, at the start of the study).
We might want to sort the whole data frame MindsetMatters
by Age
. But now we can’t use the sort()
function—that only works with vectors, not with data frames. So, if you want to sort a whole data frame, we will use a different function, arrange()
.
The arrange()
function works similarly to sort()
, except now you have to specify both the name of the data frame, and the name of the variable on which you want to sort.
arrange(MindsetMatters, Age)
And, importantly, when you sort on one variable (e.g., Age
), the order of the rows (which in this case is housekeepers) will change for every variable.
The printout of MindsetMatters
won’t stay arranged by age because we didn’t save our work. In order to save the new ordering, we need to assign the arranged version to an R object. We could assign it to a new object (e.g., Mindset2
, MM2
, or any other name you want to make up), or we could just assign it to the existing object (MindsetMatters
). If we assign it to the existing object it will revise what’s in MindsetMatters
to be in the new order.
Let’s use the assignment operator (<-
) to assign it back to MindsetMatters
. See if you can edit the code below to save the version of MindsetMatters
which is arranged by Age
back into MindsetMatters
. Then print out the first six lines of MindsetMatters
using head()
.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# save MindsetMatters, arranged by Age, back to MindsetMatters
arrange(MindsetMatters, Age)
# write code to print out the first 6 rows of MindsetMatters
# save MindsetMatters, arranged by Age, back to MindsetMatters
MindsetMatters <- arrange(MindsetMatters, Age)
# write code to print out the first 6 rows of MindsetMatters
head(MindsetMatters)
no_save <- "Make sure to both `arrange()` `MindsetMatters` by `Age` *and* save the arranged data frame back to `MindsetMatters`."
ex() %>% {
check_object(., "MindsetMatters") %>% check_equal(incorrect_msg = no_save)
check_function(., "arrange") %>% check_arg("...") %>% check_equal()
check_function(., "head") %>% check_result() %>% check_equal()
}
Cond Age Wt Wt2 BMI BMI2 Fat Fat2 WHR WHR2 Syst Syst2 Diast Diast2 Condition
1 0 19 123 124.2 19.6 19.7 26.6 NA 0.69 0.69 113 117 75 72 Uninformed
2 0 21 156 154.4 25.9 25.7 36.4 NA 0.78 0.78 116 135 67 65 Uninformed
3 1 22 127 124.6 25.6 25.2 34.6 31.6 0.74 0.73 110 103 65 69 Informed
4 1 23 161 161.4 26.8 26.9 38.1 37.1 0.90 0.86 126 101 74 64 Informed
5 0 24 90 91.8 16.5 16.8 NA NA 0.73 0.73 NA NA 78 76 Uninformed
6 1 24 166 169.0 28.5 29.0 41.3 41.1 0.88 0.90 114 123 56 55 Informed
The function arrange()
can also be used to arrange values in descending order by adding desc()
around our variable name.
arrange(MindsetMatters, desc(Age))
Try arranging MindsetMatters
by Wt
in descending order. Save this to MindsetMatters
. Print a few rows of MindsetMatters
to check out what happened.
require(coursekata)
MindsetMatters <- Lock5withR::MindsetMatters %>%
mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed")))
# arrange MindsetMatters by Wt in descending order
MindsetMatters <-
# write code to print out a few rows of MindsetMatters
# arrange MindsetMatters by Wt in descending order
MindsetMatters <- arrange(MindsetMatters, desc(Wt))
# write code to print out a few rows of MindsetMatters
head(MindsetMatters)
no_save <- "Did you save the arranged data set back to `MindsetMatters`?"
ex() %>% {
check_function(., "desc") %>%
check_arg("x") %>%
check_equal(eval = FALSE)
check_function(., "arrange") %>%
check_arg(".data") %>%
check_equal(eval = FALSE)
check_object(., "MindsetMatters") %>%
check_equal(incorrect_msg = no_save)
check_function(., "head") %>%
check_arg("x") %>%
check_equal()
}
Cond Age Wt Wt2 BMI BMI2 Fat Fat2 WHR WHR2 Syst Syst2 Diast Diast2 Condition
1 1 34 196 198.2 33.7 33.5 45.7 44.7 0.83 0.81 164 83 73 57 Informed
2 1 39 189 183.2 34.6 34.4 47.0 46.7 0.80 0.77 185 154 99 102 Informed
3 0 65 187 186.2 34.2 34.1 47.3 NA 0.89 NA 176 188 106 83 Uninformed
4 1 29 184 182.8 35.9 35.7 44.4 45.0 0.89 0.89 120 124 75 70 Informed
5 0 38 183 186.4 34.6 35.2 44.2 42.8 NA NA 115 125 70 72 Uninformed
6 0 45 182 180.0 33.8 33.5 NA 45.6 0.85 0.88 145 141 96 84 Uninformed