Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.4 Measurement

Measurement is the process of turning variation in the world into data. When we measure, we assign numbers or category labels to some sample of cases in order to represent some attribute or dimension along which the cases vary.

Let’s make this more concrete by looking at some more measurements, in a data set called Fingers. A sample of college students filled in an online survey in which they were asked a variety of basic demographic questions. They also were asked to measure the length of each finger on their right hand.

require(coursekata) Fingers <- Fingers %>% mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # A way to look at a data frame is to type its name # Look at the data frame called Fingers # A way to look at a data frame is to type its name # Look at the data frame called Fingers Fingers ex() %>% check_output_expr("Fingers")

You’ll notice that trying to look at the whole data frame can be very cumbersome, especially for larger data sets.

require(coursekata) Fingers <- Fingers %>% mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # Remember the head() command? # Use it to look at the first six rows of Fingers # Remember the head() command? # Use it to look at the first six rows of Fingers head(Fingers) ex() %>% check_output_expr("head(Fingers)", missing_msg = "Did you call `head()` with `Fingers`?")
  Sex RaceEthnic FamilyMembers SSLast Year Job MathAnxious Interest GradePredict Thumb Index Middle Ring Pinkie Height Weight
1   2          3             2     NA    3   1           4        1          3.3 66.00  79.0   84.0 74.0   57.0     62    188
2   2          3             4      9    2   2           5        3          4.0 58.42  76.2   91.4 76.2   63.5     70    145
3   2          3             2      3    2   2           2        3          4.0 70.00  80.0   90.0 70.0   65.0     69    175
4   2          1             5      7    2   1           1        3          3.7 59.00  83.0   87.0 79.0   64.0     72    155
5   2          5             2      9    3   1           5        3          4.0 64.00  76.0   89.0 76.0   69.0     70    180
6   2          3             7   7037    3   1           5        2          3.3 67.00  83.0   95.0 86.0   75.0     71    145

The command head() shows you the first six rows of a data frame, but if you wanted to look at a different number of rows, you can just add in a number at the end like this.

require(coursekata) Fingers <- Fingers %>% mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # Try it and see what happens head(Fingers, 3) # Try it and see what happens head(Fingers, 3) ex() %>% check_function("head") %>% check_arg("n") %>% check_equal()
  Sex RaceEthnic FamilyMembers SSLast Year Job MathAnxious Interest GradePredict Thumb Index Middle Ring Pinkie Height Weight
1   2          3             2     NA    3   1           4        1          3.3 66.00  79.0   84.0 74.0   57.0     62    188
2   2          3             4      9    2   2           5        3          4.0 58.42  76.2   91.4 76.2   63.5     70    145
3   2          3             2      3    2   2           2        3          4.0 70.00  80.0   90.0 70.0   65.0     69    175

Notice that to answer these questions, you need to know something about how these numbers were measured. You need to know: Was Height measured with inches? What number represents which Sex? Does FamilyMembers include the person answering the question? (Sex can be a controversial variable1 but in the case of the Fingers data set, students answered this question by selecting one of two categories.)

We will be talking a lot about what measurements mean throughout the class. But before we go on, let’s learn one more way to take a quick look at a data frame.

require(coursekata) Fingers <- Fingers %>% mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # Try using tail() to look at the last 6 rows of the Fingers data frame. # Try using tail() to look at the last 6 rows of the Fingers data frame. tail(Fingers) ex() %>% check_function("tail") %>% check_result() %>% check_equal()
    Sex RaceEthnic FamilyMembers SSLast Year Job MathAnxious Interest GradePredict Thumb Index Middle Ring Pinkie Height Weight
152   1          4             7      6    3   1           5        2          3.0    59    69     79   72     56   67.5    193
153   1          4             7      3    3   1           5        2          3.0    50    71     78   75     57   65.5    145
154   1          4             8   2354    2   2           3        2          2.7    64    70     76   70     51   59.0    114
155   1          4             3    789    1   1           4        2          2.7    50    70     85   74     55   64.0    165
156   1          3             8      0    3   2           4        2          3.7    57    67     73   65     55   63.0    125
157   1          1             6     NA    2   1           5        3          3.3    56    69     76   72     60   72.0    133

Levels of Measurement: Quantitative and Categorical Variables

Measures can be divided into two types, often referred to as “levels of measurement”: quantitative and categorical.

FamilyMembers and Height (which in this case was measured in inches) are examples of quantitative variables. The values assigned to quantitative variables represent some quantity (e.g., inches for height). And we can know that someone with a higher number (say, 62) is taller than someone with a lower number (say, 60). Moreover, the difference between the numbers actually tells us exactly how much taller one person is than another.

Categorical variables are quite different. Sex in this data set is a categorical variable. Students categorized themselves as male, female, or other. For purposes of analysis we might code each person in the following way: 1 if they are female; 2 if male; or 3 if other. The specific numbers we assign are arbitrary; we could have said other is 1, female is 2, and male is 3. The numbers don’t tell us anything about quantity; the numbers simply tell us which category the object belongs to.

While we use the terms quantitative and categorical, other writers will use other terms. They all mean roughly the same thing so you may not want to get hung up on these particular terms. Here are a few synonyms for quantitative variable and categorical variable that you may run across:

Quantitative Variable Categorical Variable
Numeric (num) variable Nominal variable
Continuous variable Qualitative variable
Scale variable Factor


Quantitative and Categorical Variables in R

Quantitative variables are always represented as numeric (or num) variables in R. Categorical variables could be either numeric or character (chr) variables in R, depending on what values they hold. If we were to code the variable Sex, for example, as 1 or 2 (for male and female) we could put the values in a numeric variable in R. If, on the other hand, we wanted to enter the values “male” or “female” into the variable Sex, R would represent it as a character variable. No matter what kind of variable we use in R, from the researcher’s point of view, the variable itself is still categorical.

R won’t necessarily know whether a variable is quantitative or categorical. A number could be used by a researcher to code a categorical variable (e.g., 1 for males and 2 for females), or it could represent units of some real quantitative measurement (1 sibling or 2 siblings). R will usually try to guess what kind of variable it is, but it may guess wrong!

For that reason, R has a way to let you specify whether a variable is categorical, using the factor() command. A factor variable, in R, is always categorical. In the Fingers data frame, Sex is coded as 1 or 2. In order for R to know that it is categorical, we can tell it by using the command factor(Fingers$Sex). Remember, we also have to save the result of the command back into the Fingers data frame if we want R to remember it. We use the following code to turn Sex into a factor, and then replace the old version of the variable, which was numeric, with the new version, a factor:

Fingers$Sex <- factor(Fingers$Sex)

We can also turn a factor back into a numeric variable by using the as.numeric() function.

If the 1s and 2s in the Sex column were numbers, we could add them up using the code sum(Fingers$Sex). But if we tell R that Sex is a factor, it will assume the 1s and 2s refer to categories, and so it won’t be willing to add them up.

Add the sum() function to find the sum of Sex when females are coded as 1s and males are coded as 2s:

require(coursekata) Fingers <- Fingers %>% #mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # this turns Sex into a numeric variable: Fingers$Sex <- as.numeric(Fingers$Sex) # write code to sum up the values of Sex # this turns Sex into a numeric variable: Fingers$Sex <- as.numeric(Fingers$Sex) # write code to sum up the values of Sex sum(Fingers$Sex) ex() %>% check_function("sum") %>% check_result() %>% check_equal()

Even though it summed up these values, we shouldn’t be totaling these values up because the 1s and 2s represent categories. The total 202 is uninterpretable.

Depending on your goals, you may decide to treat a variable with numbers as both a quantitative and a categorical variable. If this is the case, it’s a good idea to make two copies of the variable, one numeric and one factor.

For example, Likert scales (those questions that ask you to rate something on a 5- or 7-point scale) could be treated as quantitative variables in some situations, and categorical in other situations. In the Fingers data frame we have a variable called Interest, a rating by students of how interested they are in statistics. It is coded on a 3-point scale from 1 (no interest) to 3 (very interested).

If you want to ask what the average rating is, you would need the variable to be numeric in R. But if you want to compare the group of people who gave a 1 rating with those who gave a 3, you want R to know that you consider Interest to be a factor.

require(coursekata) Fingers <- Fingers %>% mutate_if(is.factor, as.numeric) %>% arrange(desc(Sex)) %>% {.[1, "FamilyMembers"] <- 2; . } %>% {.[1, "Height"] <- 62; . } # Interest has been coded numerically in the Fingers data.frame # Modify the following to convert it to factor and store it as InterestFactor in Fingers Fingers$InterestFactor <- # Interest has been coded numerically in the Fingers data.frame # Modify the following to convert it to factor and store it as InterestFactor in Fingers Fingers$InterestFactor <- factor(Fingers$Interest) ex() %>% check_object("Fingers") %>% check_column("InterestFactor") %>% check_equal()

If you made this new variable correctly, you won’t see anything appear in the R console. That’s because simply creating a new variable doesn’t cause R to print out anything. Sometimes while you are coding, you’ll feel like you did something wrong because nothing gets printed. It might just be that you didn’t tell R to print anything.

The str() command tells you the type of each variable in a data frame. In the code you just wrote, you told R to make a new factor variable, Fingers$InterestFactor, based on the numeric variable, Fingers$Interest. If you wanted to check whether you were successful, you could type str(Fingers) in the code window you were just working in.

The output shows that the Fingers data frame now includes a new variable, Fingers$InterestFactor, and also confirms that this new variable is a factor variable.

str(Fingers)

'data.frame':  157 obs. of  17 variables:
 $ Sex            : num  2 2 2 2 2 2 2 2 2 2 ...
 $ RaceEthnic     : num  3 3 3 1 5 3 1 4 3 3 ...
 $ FamilyMembers  : num  2 4 2 5 2 7 4 3 7 5 ...
 $ SSLast         : num  NA 9 3 7 9 ...
 $ Year           : num  3 2 2 2 3 3 3 3 1 3 ...
 $ Job            : num  1 2 2 1 1 1 2 2 1 2 ...
 $ MathAnxious    : num  4 5 2 1 5 5 2 1 4 2 ...
 $ Interest       : num  1 3 3 3 3 2 2 3 2 1 ...
 $ GradePredict   : num  3.3 4 4 3.7 4 3.3 4 4 3 3.7 ...
 $ Thumb          : num  66 58.4 70 59 64 ...
 $ Index          : num  79 76.2 80 83 76 83 70 75 74 63 ...
 $ Middle         : num  84 91.4 90 87 89 95 76 83 83 70 ...
 $ Ring           : num  74 76.2 70 79 76 86 72 78 79 65 ...
 $ Pinkie         : num  57 63.5 65 64 69 75 55 60 64 56 ...
 $ Height         : num  62 70 69 72 70 71 67.5 69 68.5 65 ...
 $ Weight         : num  188 145 175 155 180 145 130 180 193 138 ...
 $ InterestFactor : Factor w/ 3 levels "1","2","3": 1 3 3 3 3 2 2 3 2 1 ...

Notice how the two variables have a different structure in the data frame. Interest is marked as a variable made of numbers (num). But InterestFactor is now marked as Factor w/ 3 levels "1","2","3"". The levels represent the different values (or categories) of this categorical variable.


  1. Many people use sex and gender interchangeably, but in truth, they’re distinct concepts. Sex is a classification based on biological characteristics, including DNA and anatomy. Gender refers to the socially constructed roles, behaviors, expressions, and identities of girls, women, boys, men, and gender diverse people. There is some evidence to suggest that both sex and gender are not made up of binary categories but instead expressed on a spectrum. Many people’s bodies possess a combination of physical characteristics typically thought of as biologically “male” or “female.” It’s been estimated that babies with intersex traits may be as high as 2% of live births (Blackless et al., 2000). However, sex is often measured as a binary and categorical variable in publicly available dataframes included in this textbook. This may change as researchers develop new methods of measuring sex.

Responses