Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.3 A Data Frame Example: MindsetMatters

The data we looked at on the previous page were selected from a data frame called MindsetMatters. The full data frame is from a study that investigated the health of 75 female housekeepers from different hotels. You can read more about how these data were collected and organized here: [MindsetMatters R documentation].

A data frame is a kind of object in R, and as with any object, you can just type the name of it to see the whole thing.

Type the name of the data frame MindsetMatters and then Run.

require(coursekata) MindsetMatters <- Lock5withR::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Try typing MindsetMatters to see what is in the data frame. # Try typing MindsetMatters to see what is in the data frame. MindsetMatters ex() %>% check_output_expr("MindsetMatters")

You may need to scroll up to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data—there are a whole lot of things in a data set, including a lot of variables and values. And usually we don’t just sample one case (e.g., one housekeeper)—we have a bunch of housekeepers, each with their own values for a bunch of variables. So things get pretty complicated, pretty fast.

It’s always useful to take a quick peek at your data frame. But looking at the whole thing might be a little complicated. So a helpful command is head() which shows you just the first few rows of a data frame.

Press the <Run> button to see what happens when you run the command head(MindsetMatters).

require(coursekata) MindsetMatters <- Lock5withR::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Run this code to get the first 6 rows of MindsetMatters head(MindsetMatters) # Run this code to get the first 6 rows of MindsetMatters head(MindsetMatters) ex() %>% check_function("head") %>% check_result() %>% check_equal()
  Cond Age  Wt   Wt2  BMI BMI2  Fat Fat2  WHR WHR2 Syst Syst2 Diast  Diast2  Condition
1    0  43 137 137.4 25.1 25.1 31.9 32.8 0.79 0.79  124   118    70     731 Uninformed
2    0  42 150 147.0 29.3 28.7 35.5   NA 0.81 0.81  119   112    80     682 Uninformed
3    0  41 124 124.8 26.9 27.0 35.1   NA 0.84 0.84  108   107    59     653 Uninformed
4    0  40 173 171.4 32.8 32.4 41.9 42.4 1.00 1.00  116   126    71     794 Uninformed
5    0  33 163 160.2 37.9 37.2 41.7   NA 0.86 0.84  113   114    73     784 Uninformed
6    0  24  90  91.8 16.5 16.8   NA   NA 0.73 0.73   NA    NA    78     764 Uninformed

The head() function just prints out the first six rows of the data frame as rows and columns.

Sometimes, it’s useful just to get an overview of what’s in the data frame. The function str() shows us the overall structure of the data frame, including number of observations, number of variables, names of variables and so on. (We often use str() when first exploring a new data frame, just to see what’s in it.)

Run str() on MindsetMatters and look at the results.

require(coursekata) MindsetMatters <- Lock5withR::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Run this code to see the structure of MindsetMatters str(MindsetMatters) # Run this code to see the structure of MindsetMatters str(MindsetMatters) ex() %>% check_function("str") %>% check_result() %>% check_equal()
'data.frame':  75 obs. of  15 variables:
 $ Cond     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age      : int  43 42 41 40 33 24 46 21 29 19 ...
 $ Wt       : int  137 150 124 173 163 90 150 156 141 123 ...
 $ Wt2      : num  137 147 125 171 160 ...
 $ BMI      : num  25.1 29.3 26.9 32.8 37.9 16.5 27.5 25.9 27.5 19.6 ...
 $ BMI2     : num  25.1 28.7 27 32.4 37.2 16.8 27.4 25.7 27.4 19.7 ...
 $ Fat      : num  31.9 35.5 35.1 41.9 41.7 NA 36.1 36.4 NA 26.6 ...
 $ Fat2     : num  32.8 NA NA 42.4 NA NA 37.3 NA NA NA ...
 $ WHR      : num  0.79 0.81 0.84 1 0.86 0.73 0.9 0.78 0.87 0.69 ...
 $ WHR2     : num  0.79 0.81 0.84 1 0.84 0.73 0.9 0.78 0.85 0.69 ...
 $ Syst     : int  124 119 108 116 113 NA 119 116 110 113 ...
 $ Syst2    : int  118 112 107 126 114 NA 115 135 115 117 ...
 $ Diast    : int  70 80 59 71 73 78 75 67 73 75 ...
 $ Diast2   : int  73 68 65 79 78 76 77 65 74 72 ...
 $ Condition: Factor w/ 2 levels "Informed","Uninformed": 2 2 2 2 2 2 2 2 2 2 ...

Note that there is a $ in front of each variable name. In R, $ is often used to indicate that what follows is a variable name. If you want to specify the Age variable in the MindsetMatters data frame, for example, you would write MindsetMatters$Age. (R has its own way of categorizing variables, such as int, num, and Factor. You will learn more about these later.)

Try using the $ to print out just the variable Age from MindsetMatters.

require(coursekata) MindsetMatters <- Lock5withR::MindsetMatters %>% mutate(Condition = factor(Cond, levels = c(1, 0), labels = c("Informed", "Uninformed"))) # Use the $ sign to print out the contents of the Age variable in the MindsetMatters data frame # Use the $ sign to print out the contents of the Age variable in the MindsetMatters data frame MindsetMatters$Age ex() %>% check_output_expr("MindsetMatters$Age", missing_msg = "Have you used $ to select the Age variable in MindsetMatters?")

That’s a lot of numbers! If you want a more organized list, you can sometimes get that by using the print() function, like this:

print(MindsetMatters$Age)

You can try adding the print() function in the window above. When you do you get something like this:

 [1] 43 42 41 40 33 24 46 21 29 19 41 33 44 48 38 42 38 46 45 35 30 38 41 54 65
[26] 58 29 45 57 61 38 53 45 62 48 50 40 32 54 24 24 52 34 28 31 29 31 34 26 37
[51] 28 44 26 29 47 27 42 39 27 NA 27 48 39 55 26 29 27 33 29 33 31 24 22 23 38

When R is asked to print out a single variable (such as Age), R prints out each person’s value on the variable all in a row. When it gets to the end of one row it begins again on the next row. In contrast, when R is asked to print out multiple variables, it uses the rows and columns format, where rows are cases and columns are variables.

If you counted the ages printed on the first row, there are 25 of them. The [26] indicates that the next row starts with the 26th observation.

Responses