Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.6 Quantitative and Categorical Variables
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
2.6 Quantitative and Categorical Variables
Measures can be divided into two types, often referred to as “levels of measurement”: quantitative and categorical.
FamilyMembers
and Height
(which in this
case was measured in inches) are examples of quantitative
variables. The values assigned to quantitative variables represent
some quantity (e.g., inches for height). And we can know that someone
with a higher number (say, 62) is taller than someone with a lower
number (say, 60). Moreover, the difference between the numbers actually
tells us exactly how much taller one person is than another.
Categorical variables are quite different.
Gender
in this dataset is a categorical variable. Students
categorized themselves as male, female, or other. For purposes of
analysis we might code each person in the following way: 1 if they are
female; 2 if male; or 3 if other. The specific numbers we assign are
arbitrary; we could have said other is 1, female is 2, and male is 3.
The numbers don’t tell us anything about quantity; the numbers simply
tell us which category the object belongs to.
While we use the terms quantitative and categorical, other writers will use other terms. They all mean roughly the same thing so you may not want to get hung up on these particular terms. Here are a few synonyms for quantitative variable and categorical variable that you may run across:
Quantitative Variable | Categorical Variable |
---|---|
Numeric (num) variable | Nominal variable |
Continuous variable | Qualitative variable |
Scale variable | Factor |
Quantitative and Categorical Variables in R
Quantitative variables are always represented as numeric (or
num) variables in R. Categorical variables could be
either numeric or character (chr) variables
in R, depending on what values they hold. If we were to code the
variable Gender
, for example, as 1 or 2 (for male and
female) we could put the values in a numeric variable in R. If, on the
other hand, we wanted to enter the values “male” or “female” into the
variable Gender
, R would represent it as a character
variable. No matter what kind of variable we use in R, from the
researcher’s point of view, the variable itself is still
categorical.
R won’t necessarily know whether a variable is quantitative or categorical. A number could be used by a researcher to code a categorical variable (e.g., 1 for males and 2 for females), or it could represent units of some real quantitative measurement (1 sibling or 2 siblings). R will usually try to guess what kind of variable it is, but it may guess wrong!
For that reason, R has a way to let you specify whether a variable is
categorical, using the factor()
command. A factor
variable, in R, is always categorical. In the Fingers
data
frame, Gender
is coded as 1 or 2. In order for R to know
that it is categorical, we can tell it by using the command
factor(Fingers$Gender)
. Remember, we also have to save the
result of the command back into the Fingers
data frame if
we want R to remember it. We use the following code to turn
Gender
into a factor, and then replace the old version of
the variable, which was numeric, with the new version, a factor:
Fingers$Gender <- factor(Fingers$Gender)
We can also turn a factor back into a numeric variable by using the
as.numeric()
function.
If the 1s and 2s in the Gender
column were numbers, we
could add them up using the code sum(Fingers$Gender)
. But
if we tell R that Gender
is a factor, it will assume the 1s
and 2s refer to categories, and so it won’t be willing to add them
up.
Add the sum()
function to find the sum of
Gender
when females are coded as 1s and males are coded as
2s:
require(coursekata)
Fingers <- Fingers %>%
#mutate_if(is.factor, as.numeric) %>%
arrange(desc(Gender)) %>%
{.[1, "FamilyMembers"] <- 2; . } %>%
{.[1, "Height"] <- 62; . }
# this turns Gender into a numeric variable:
Fingers$Gender <- as.numeric(Fingers$Gender)
# write code to sum up the values of Gender
# this turns Gender into a numeric variable:
Fingers$Gender <- as.numeric(Fingers$Gender)
# write code to sum up the values of Gender
sum(Fingers$Gender)
ex() %>%
check_function("sum") %>%
check_result() %>%
check_equal()
Even though it summed up these values, we shouldn’t be totaling these values up because the 1s and 2s represent categories. The total 202 is not easy to interpret.
Depending on your goals, you may decide to treat a variable with numbers as both a quantitative and a categorical variable. If this is the case, it’s a good idea to make two copies of the variable, one numeric and one factor.
For example, Likert scales (those questions that ask you to rate
something on a 5- or 7-point scale) could be treated as
quantitative variables in some situations, and
categorical in other situations. In the Fingers
data frame we have a variable called Interest
, a rating by
students of how interested they are in statistics. It is coded on a
3-point scale from 1 (no interest) to 3 (very interested).
If you want to ask what the average rating is, you would need the
variable to be numeric in R. But if you want to compare the
group of people who gave a 1 rating with those who gave a 3, you want R
to know that you consider Interest
to be a
factor.
require(coursekata)
Fingers <- Fingers %>%
mutate_if(is.factor, as.numeric) %>%
arrange(desc(Gender)) %>%
{.[1, "FamilyMembers"] <- 2; . } %>%
{.[1, "Height"] <- 62; . }
# Interest has been coded numerically in the Fingers data.frame
# Modify the following to convert it to factor and store it as InterestFactor in Fingers
Fingers$InterestFactor <-
# Interest has been coded numerically in the Fingers data.frame
# Modify the following to convert it to factor and store it as InterestFactor in Fingers
Fingers$InterestFactor <- factor(Fingers$Interest)
ex() %>%
check_object("Fingers") %>%
check_column("InterestFactor") %>%
check_equal()
If you made this new variable correctly, you won’t see anything appear in the R console. That’s because simply creating a new variable doesn’t cause R to print out anything. Sometimes while you are coding, you’ll feel like you did something wrong because nothing gets printed. It might just be that you didn’t tell R to print anything.
The str()
command tells you the type of each variable in
a data frame. In the code you just wrote, you told R to make a new
factor variable, Fingers$InterestFactor
, based on the
numeric variable, Fingers$Interest
. If you wanted to check
whether you were successful, you could type str(Fingers)
in
the code window you were just working in.
The output shows that the Fingers
data frame now
includes a new variable, Fingers$InterestFactor
, and also
confirms that this new variable is a factor variable.
str(Fingers)
'data.frame': 157 obs. of 17 variables:
$ Gender : num 2 2 2 2 2 2 2 2 2 2 ...
$ RaceEthnic : num 3 3 3 1 5 3 1 4 3 3 ...
$ FamilyMembers : num 2 4 2 5 2 7 4 3 7 5 ...
$ SSLast : num NA 9 3 7 9 ...
$ Year : num 3 2 2 2 3 3 3 3 1 3 ...
$ Job : num 1 2 2 1 1 1 2 2 1 2 ...
$ MathAnxious : num 4 5 2 1 5 5 2 1 4 2 ...
$ Interest : num 1 3 3 3 3 2 2 3 2 1 ...
$ GradePredict : num 3.3 4 4 3.7 4 3.3 4 4 3 3.7 ...
$ Thumb : num 66 58.4 70 59 64 ...
$ Index : num 79 76.2 80 83 76 83 70 75 74 63 ...
$ Middle : num 84 91.4 90 87 89 95 76 83 83 70 ...
$ Ring : num 74 76.2 70 79 76 86 72 78 79 65 ...
$ Pinkie : num 57 63.5 65 64 69 75 55 60 64 56 ...
$ Height : num 62 70 69 72 70 71 67.5 69 68.5 65 ...
$ Weight : num 188 145 175 155 180 145 130 180 193 138 ...
$ InterestFactor : Factor w/ 3 levels "1","2","3": 1 3 3 3 3 2 2 3 2 1 ...
$ Sex : num 2 2 2 2 2 2 2 2 2 2 ...
Notice how the two variables have a different structure in the data
frame. Interest
is marked as a variable made of numbers
(num
). But InterestFactor
is now marked as
Factor w/ 3 levels "1","2","3""
. The levels represent the
different values (or categories) of this categorical variable.