Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

1.6 Variable Types in R

Let’s look at a new dataset, called Ames. The data describe a sample of 185 homes sold in Ames, Iowa during a particular time period. Ames is located about 30 miles north of Des Moines (the state capitol) and is home to Iowa State University (the largest university in the state).

Write some code below to look at the first six rows of the Ames data frame:

require(coursekata) # Use the head() function to look at the first six rows of Ames # Use the head() function to look at the first six rows of head(Ames) ex() %>% check_output_expr( "head(Ames)", missing_msg = "Did you call `head()` with `Ames`?" )
CK Code: X1_Code_VarTypes_01

There are a lot of variables in this data frame – you can scroll right and left to see them all.

Each row in the Ames data frame represents a particular home. Each variable describes a different feature of the homes in the data frame, including the year each home was built (YearBuilt), how big the house is (HomeSize), and what neighborhood it is in (Neighborhood).

Quantitative and Categorical Variables in R

Broadly speaking, variables can be divided into two types: quantitative and categorical. Quantitative variables take numerical values (e.g., 3 or 1.25). For quantitative variables, the values they are assigned represent quantities such that observations with higher numbers are assumed to have more of the quantity than those with lower numbers. In Ames, for example, we can assume that a home with a BuildQuality of 7 is actually of higher quality than one with a value of 5.

The values assigned to categorical variables do not represent quantities. Instead, they represent categories. For example, in Ames, the variable Foundation is coded with values such as PouredConcrete or CinderBlock. The difference is not quantitative; these are just two different types of foundations.

Most quantitative variables are categorized by R as numeric (or num). (They may on occasion be categorized as int for integer or dbl for double – which basically means that the numbers have decimals.) The nice thing about all these types of variables (num, int, and dbl) is that R knows it can add, subtract, multiply, divide, etc, their values. That’s good!

Categorical variables are a slightly different story. Take, for example, the variable HasCentralAir in the Ames dataset (the first six values are printed below).

 HasCentralAir
1             1
2             1
3             1
4             1
5             0
6             1

Even though the variable is coded with numbers (1 represents “has central air”, 0, “does not have central air”), it really is a categorical variable. We know that. But R does not know that unless we tell it. R will usually try to guess what kind of variable it is, but it may guess wrong!

For that reason, R has a way to let you specify whether a variable is categorical, using the factor() function. If you tell R that a variable is a factor, it will treat it as a categorical variable. To tell R that HasCentralAir is categorical, we can write factor(Ames$HasCentralAir).

But in order for this change to stick, we have to save this new version of the variable back into the Ames data frame.

Ames$HasCentralAir <- factor(Ames$HasCentralAir)

If the 0s and 1s in the HasCentralAir column represented true quantities, we could add them up using the code sum(Ames$HasCentralAir). But if we tell R that HasCentralAir is a factor, it will assume the 0s and 1s refer to categories, and so it won’t be willing to add them up.

First try running this code to find the sum of HasCentralAir when it is coded as a factor. We should get an error!

require(coursekata) # this turns HasCentralAir into a factor Ames$HasCentralAir <- factor(Ames$HasCentralAir) # this code sums up Ames$HasCentralAir sum(Ames$HasCentralAir) # how would we fix this error? (read text below!) # this turns HasCentralAir into a factor Ames$HasCentralAir <- factor(Ames$HasCentralAir) # this turns it back into a numbers Ames$HasCentralAir <- as.numeric(Ames$HasCentralAir) sum(Ames$HasCentralAir) ex() %>% check_function("sum") %>% check_result() %>% check_equal()
CK Code: X1_Code_VarTypes_02

If HasCentralAir is a factor, sum() won’t work. If we (for some weird reason) wanted to get the sum of the HasCentralAir variable, we can turn a factor back into a numeric variable by using the as.numeric() function instead of factor(). Make the fix in the code window above (then submit).

Responses