list

Statistics and Data Science: A Modeling Approach

2.2 A Data Frame Example: MindsetMatters

The data we looked at on the previous page were selected from a data frame called MindsetMatters. The full data frame is from a study that investigated the health of 75 female housekeepers from different hotels. You can read more about how these data were collected and organized here: [MindsetMatters R documentation].

A data frame is a kind of object in R, and as with any object, you can just type the name of it to see the whole thing.

Type the name of the data frame MindsetMatters and then Run.

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Try typing MindsetMatters to see what is in the data frame. MindsetMatters ex() %>% check_output_expr("MindsetMatters")
Just type MindsetMatters then click Run
DataCamp: ch2-3

Be sure to scroll up on the R console to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data—there are a whole lot of things in a data set, including a lot of variables and values. And usually we don’t just sample one case (e.g., one housekeeper)—we have a bunch of housekeepers, each with their own values for a bunch of variables. So things get pretty complicated, pretty fast.

It’s always useful to take a quick peek at your data frame. But looking at the whole thing might be a little complicated. So a helpful command is head() which shows you just the first few rows of a data frame.

Press the Run button to see what happens when you run the command head(MindsetMatters).

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Run this code to get the first 6 rows of MindsetMatters head(MindsetMatters) head(MindsetMatters) ex() %>% check_function("head") %>% check_result() %>% check_equal()
Just click Submit
DataCamp: ch2-4

  Cond Age  Wt   Wt2  BMI BMI2  Fat Fat2  WHR WHR2 Syst Syst2 Diast  Diast2  Condition
1    0  43 137 137.4 25.1 25.1 31.9 32.8 0.79 0.79  124   118    70     731 Uninformed
2    0  42 150 147.0 29.3 28.7 35.5   NA 0.81 0.81  119   112    80     682 Uninformed
3    0  41 124 124.8 26.9 27.0 35.1   NA 0.84 0.84  108   107    59     653 Uninformed
4    0  40 173 171.4 32.8 32.4 41.9 42.4 1.00 1.00  116   126    71     794 Uninformed
5    0  33 163 160.2 37.9 37.2 41.7   NA 0.86 0.84  113   114    73     784 Uninformed
6    0  24  90  91.8 16.5 16.8   NA   NA 0.73 0.73   NA    NA    78     764 Uninformed

The head() function just prints out the first six rows of the data frame as rows and columns.

Depending on your browser, your output might not look exactly like the one pictured above. The output might be easier to read if you adjust the vertical divider to make the console wider. However, with so many variables, they still won’t all fit in one row. When the row numbers start again, the output continues with the columns that didn’t fit. Your output might look more like this:

  Cond Age  Wt   Wt2  BMI BMI2  Fat Fat2  WHR WHR2 Syst Syst2 Diast Diast2
1    0  43 137 137.4 25.1 25.1 31.9 32.8 0.79 0.79  124   118    70     73
2    0  42 150 147.0 29.3 28.7 35.5   NA 0.81 0.81  119   112    80     68
3    0  41 124 124.8 26.9 27.0 35.1   NA 0.84 0.84  108   107    59     65
4    0  40 173 171.4 32.8 32.4 41.9 42.4 1.00 1.00  116   126    71     79
5    0  33 163 160.2 37.9 37.2 41.7   NA 0.86 0.84  113   114    73     78
6    0  24  90  91.8 16.5 16.8   NA   NA 0.73 0.73   NA    NA    78     76
   Condition
1 Uninformed
2 Uninformed
3 Uninformed
4 Uninformed
5 Uninformed
6 Uninformed

In the output (pictured above), Condition didn’t fit in on the same row. Let’s try to read the data for the first housekeeper (in row 1). Her age (Age) was 43, starting weight (Wt) was 137 pounds, starting BMI (BMI) was 25.1, and BMI at the end of the study (BMI2) was 25.1. She was also in the uninformed condition.

Sometimes, it’s useful just to get an overview of what’s in the data frame. The function str() shows us the overall structure of the data frame, including number of observations, number of variables, names of variables and so on. (We often use str() when first exploring a new data frame, just to see what’s in it.)

Run str() on MindsetMatters and look at the results.

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Run this code to see the structure of MindsetMatters str(MindsetMatters) str(MindsetMatters) ex() %>% check_function("str") %>% check_result() %>% check_equal()
Just click Run
DataCamp: ch2-5

'data.frame':   75 obs. of  15 variables:
 $ Cond     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age      : int  43 42 41 40 33 24 46 21 29 19 ...
 $ Wt       : int  137 150 124 173 163 90 150 156 141 123 ...
 $ Wt2      : num  137 147 125 171 160 ...
 $ BMI      : num  25.1 29.3 26.9 32.8 37.9 16.5 27.5 25.9 27.5 19.6 ...
 $ BMI2     : num  25.1 28.7 27 32.4 37.2 16.8 27.4 25.7 27.4 19.7 ...
 $ Fat      : num  31.9 35.5 35.1 41.9 41.7 NA 36.1 36.4 NA 26.6 ...
 $ Fat2     : num  32.8 NA NA 42.4 NA NA 37.3 NA NA NA ...
 $ WHR      : num  0.79 0.81 0.84 1 0.86 0.73 0.9 0.78 0.87 0.69 ...
 $ WHR2     : num  0.79 0.81 0.84 1 0.84 0.73 0.9 0.78 0.85 0.69 ...
 $ Syst     : int  124 119 108 116 113 NA 119 116 110 113 ...
 $ Syst2    : int  118 112 107 126 114 NA 115 135 115 117 ...
 $ Diast    : int  70 80 59 71 73 78 75 67 73 75 ...
 $ Diast2   : int  73 68 65 79 78 76 77 65 74 72 ...
 $ Condition: Factor w/ 2 levels "Informed","Uninformed": 2 2 2 2 2 2 2 2 2 2 ...

Note that there is a $ in front of each variable name. In R, $ is often used to indicate that what follows is a variable name. If you want to specify the Age variable in the MindsetMatters data frame, for example, you would write MindsetMatters$Age. (R has its own way of categorizing variables, such as int, num, and Factor. You will learn more about these later.)

Try using the $ to print out just the variable Age from MindsetMatters.

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters # Use the $ sign to print out the contents of the Age variable in the MindsetMatters data frame MindsetMatters$Age ex() %>% check_output_expr("MindsetMatters$Age", missing_msg = "Have you used $ to select the Age variable in MindsetMatters?")
Use $, Age, and MindsetMatters
DataCamp: ch2-27

 [1] 43 42 41 40 33 24 46 21 29 19 41 33 44 48 38 42 38 46 45 35 30 38 41 54 65
[26] 58 29 45 57 61 38 53 45 62 48 50 40 32 54 24 24 52 34 28 31 29 31 34 26 37
[51] 28 44 26 29 47 27 42 39 27 NA 27 48 39 55 26 29 27 33 29 33 31 24 22 23 38

When R is asked to print out a single variable (such as Age), R prints out each person’s value on the variable all in a row. When it gets to the end of one row it begins again on the next row. In contrast, when R is asked to print out multiple variables, it uses the rows and columns format, where rows are cases and columns are variables.

If you counted the ages printed on the first row, there are 25 of them. The [26] indicates that the next row starts with the 26th observation.

Let’s use the tally() function to create a frequency table of the Age variable (in the MindsetMatters data frame). This will tell us how many housekeepers there were of each age.

tally(MindsetMatters$Age)

We don’t have to use the $ notation. We could also specify the variable and data frame separately, like this:

tally(~ Age, data = MindsetMatters)
Age
  19   21   22   23   24   26   27   28   29   30   31   32   33   34   35   37 
   1    1    1    1    4    3    4    2    6    1    3    1    4    2    1    1 
  38   39   40   41   42   43   44   45   46   47   48   50   52   53   54   55 
   5    2    2    3    3    1    2    3    2    1    3    1    1    1    2    1 
  57   58   61   62   65 <NA> 
   1    1    1    1    1    1

The rows that start with 19, 38, and 57 represent the ages of the housekeepers and the numbers underneath them represent how many of each age are in the data frame. For example, there is one housekeeper who is 19 years old. There are two housekeepers who are 54 years old. There are three housekeepers who are 45 years old.

Try using the tally function to make a frequency table of housekeepers by Condition.

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Use tally() with the MindsetMatters data frame to create a frequency table of housekeepers by Condition tally(~Condition, data=MindsetMatters) # Another solution # tally(MindsetMatters$Condition) ex() %>% check_function("tally") %>% check_result() %>% check_equal()
Use tally(), ~Condition, and data = MindsetMatters
DataCamp: ch2-6

The output of tally() shows us that there are 41 housekeepers who were in the Informed condition and 34 in the Uninformed condition. Taking a look at this frequency table, we might wonder why there were slightly more housekeepers who were informed that their daily work of cleaning was equivalent to getting adequate exercise.

Let’s turn our attention to two variables in the MindsetMatters data frame: Age (the age of the housekeepers, in years, at the start of the study) and Wt (their weight, in pounds, at the start of the study).

We might want to sort the whole data frame MindsetMatters by Age. But now we can’t use the sort() function—that only works with vectors, not with data frames. So, if you want to sort a whole data frame, we will use a different function, arrange().

The arrange() function works similarly to sort(), except now you have to specify both the name of the data frame, and the name of the variable on which you want to sort.

arrange(MindsetMatters, Age)

And, importantly, when you sort on one variable (e.g., Age), the order of the rows (which in this case is housekeepers) will change for every variable.

The printout of MindsetMatters won’t stay arranged by age because we didn’t save our work. In order to save the new ordering, we need to assign the arranged version to an R object. We could assign it to a new object (e.g., Mindset2, MM2, or any other name you want to make up), or we could just assign it to the existing object (MindsetMatters). If we assign it to the existing object it will revise what’s in MindsetMatters to be in the new order.

Let’s use the assignment operator (<-) to assign it back to MindsetMatters. See if you can edit the code below to save the version of MindsetMatters which is arranged by Age back into MindsetMatters. Then print out the first six lines of MindsetMatters using head().

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # save MindsetMatters, arranged by Age, back to MindsetMatters arrange(MindsetMatters, Age) # write code to print out the first 6 rows of MindsetMatters MindsetMatters <- arrange(MindsetMatters, Age) head(MindsetMatters) no_save <- "Make sure to both `arrange()` `MindsetMatters` by `Age` *and* save the arranged data frame back to `MindsetMatters`." ex() %>% { check_object(., "MindsetMatters") %>% check_equal(incorrect_msg = no_save) check_function(., "arrange") %>% check_arg("...") %>% check_equal() check_function(., "head") %>% check_result() %>% check_equal() }
Save the arranged data set back to MindsetMatters. Then print the first six rows of MindsetMatters.
DataCamp: ch2-7

  Cond Age  Wt   Wt2  BMI BMI2  Fat Fat2  WHR WHR2 Syst Syst2 Diast Diast2  Condition
1    0  19 123 124.2 19.6 19.7 26.6   NA 0.69 0.69  113   117    75     72 Uninformed
2    0  21 156 154.4 25.9 25.7 36.4   NA 0.78 0.78  116   135    67     65 Uninformed
3    1  22 127 124.6 25.6 25.2 34.6 31.6 0.74 0.73  110   103    65     69   Informed
4    1  23 161 161.4 26.8 26.9 38.1 37.1 0.90 0.86  126   101    74     64   Informed
5    0  24  90  91.8 16.5 16.8   NA   NA 0.73 0.73   NA    NA    78     76 Uninformed
6    1  24 166 169.0 28.5 29.0 41.3 41.1 0.88 0.90  114   123    56     55   Informed

The function arrange() can also be used to arrange values in descending order by adding desc() around our variable name.

arrange(MindsetMatters, desc(Age))

Try arranging MindsetMatters by Wt in descending order. Save this to MindsetMatters. Print a few rows of MindsetMatters to check out what happened.

require(tidyverse) require(mosaic) require(supernova) MindsetMatters <- Lock5Data::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # arrange MindsetMatters by Wt in descending order MindsetMatters <- # write code to print out a few rows of MindsetMatters MindsetMatters <- arrange(MindsetMatters, desc(Wt)) head(MindsetMatters) no_save <- "Did you save the arranged data set back to `MindsetMatters`?" no_print <- "Did you print the first few rows of the data set?" ex() %>% { check_function(., "desc") %>% check_arg("x") %>% check_equal(eval = FALSE) check_function(., "arrange") %>% check_arg(".data") %>% check_equal(eval = FALSE) check_object(., "MindsetMatters") %>% check_equal(incorrect_msg = no_save) check_output_expr(., "head(MindsetMatters)", missing_msg = no_print) }
Did you arrange by desc(Wt)?
DataCamp: ch2-8

  Cond Age  Wt   Wt2  BMI BMI2  Fat Fat2  WHR WHR2 Syst Syst2 Diast Diast2  Condition
1    1  34 196 198.2 33.7 33.5 45.7 44.7 0.83 0.81  164    83    73     57   Informed
2    1  39 189 183.2 34.6 34.4 47.0 46.7 0.80 0.77  185   154    99    102   Informed
3    0  65 187 186.2 34.2 34.1 47.3   NA 0.89   NA  176   188   106     83 Uninformed
4    1  29 184 182.8 35.9 35.7 44.4 45.0 0.89 0.89  120   124    75     70   Informed
5    0  38 183 186.4 34.6 35.2 44.2 42.8   NA   NA  115   125    70     72 Uninformed
6    0  45 182 180.0 33.8 33.5   NA 45.6 0.85 0.88  145   141    96     84 Uninformed