Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.6 Variable Types in R

Let’s look at a new dataset, called Ames. The data describe a sample of 185 homes sold in Ames, Iowa during a particular time period. Ames is located about 30 miles north of Des Moines (the state capitol) and is home to Iowa State University (the largest university in the state).

Write some code below to look at the first six rows of the Ames data frame:

require(coursekata) # Use the head() function to look at the first six rows of Ames # Use the head() function to look at the first six rows of head(Ames) ex() %>% check_output_expr( "head(Ames)", missing_msg = "Did you call `head()` with `Ames`?" )

There are a lot of variables in this data frame – you can scroll right and left to see them all.

Each row in the Ames data frame represents a particular home. Each variable describes a different feature of the homes in the data frame, including the year each home was built (YearBuilt), how big the house is (HomeSize), and what neighborhood it is in (Neighborhood).

Quantitative and Categorical Variables in R

Broadly speaking, variables can be divided into two types: quantitative and categorical. Quantitative variables take numerical values (e.g., 3 or 1.25). For quantitative variables, the values they are assigned represent quantities such that observations with higher numbers are assumed to have more of the quantity than those with lower numbers. In Ames, for example, we can assume that a home with a BuildQuality of 7 is actually of higher quality than one with a value of 5.

The values assigned to categorical variables do not represent quantities. Instead, they represent categories. For example, in Ames, the variable Foundation is coded with values such as PouredConcrete or CinderBlock. The difference is not quantitative; these are just two different types of foundations.

Most quantitative variables are categorized by R as numeric (or num). (They may on occasion be categorized as int for integer or dbl for double – which basically means that the numbers have decimals.) The nice thing about all these types of variables (num, int, and dbl) is that R knows it can add, subtract, multiply, divide, etc, their values. That’s good!

Categorical variables are a slightly different story. Take, for example, the variable HasCentralAir in the Ames dataset (the first six values are printed below).

 HasCentralAir
1             1
2             1
3             1
4             1
5             0
6             1

Even though the variable is coded with numbers (1 represents “has central air”, 0, “does not have central air”), it really is a categorical variable. We know that. But R does not know that unless we tell it. R will usually try to guess what kind of variable it is, but it may guess wrong!

For that reason, R has a way to let you specify whether a variable is categorical, using the factor() function. If you tell R that a variable is a factor, it will treat it as a categorical variable. To tell R that HasCentralAir is categorical, we can write factor(Ames$HasCentralAir).

But in order for this change to stick, we have to save this new version of the variable back into the Ames data frame.

Ames$HasCentralAir <- factor(Ames$HasCentralAir)

If the 0s and 1s in the HasCentralAir column represented true quantities, we could add them up using the code sum(Ames$HasCentralAir). But if we tell R that HasCentralAir is a factor, it will assume the 0s and 1s refer to categories, and so it won’t be willing to add them up.

In the code block below, add the sum() function to find the sum of HasCentralAir when it is coded as a numeric variable (R thinks of the 0s and 1s as numbers).

require(coursekata) # this turns HasCentralAir into a numeric variable Ames$HasCentralAir <- as.numeric(Ames$HasCentralAir) # add code to sum up the values of HasCentralAir Ames$HasCentralAir # this turns HasCentralAir into a numeric variable Ames$HasCentralAir <- as.numeric(Ames$HasCentralAir) # add code to sum up the values of HasCentralAir sum(Ames$HasCentralAir) ex() %>% check_function("sum") %>% check_result() %>% check_equal()

Even though R summed up these values, we shouldn’t be totaling these values up because the 0s and 1s represent categories. The total is uninterpretable.

Responses