Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

1.7 Selecting & Filtering Data in R

Sometimes you want to focus on a subset of the variables in a data frame. For example, you might want to look at just the variables PriceK and PriceR in the Ames data frame. PriceK represents the sale price of the home in thousands of dollars. PriceR represents the sale price in dollars.

We can use the select() function to look at just a subset of variables. When using select(), we first need to tell R which data frame, then which variables to select from that data frame.

select(Ames, PriceK, PriceR)

Modify the select() code below to take a look at just the following variables in Ames: PriceK, PriceR, and Neighborhood.

require(coursekata) # Modify this code select(Ames, ...) select(Ames, PriceK, PriceR, Neighborhood) ex() %>% check_output_expr("select(Ames, PriceK, PriceR, Neighborhood)")
CK Code: X1_Code_Selecting_01

Running the select() function will print out the values of the selected variables for every case. If you want to just look at the first six rows you can combine the head() and select() functions like this: head(select(Ames, PriceK, PriceR, Neighborhood)).

 PriceK PriceR Neighborhood
1    260 260000 CollegeCreek
2    210 210000 CollegeCreek
3    155 155000      OldTown
4    125 125000      OldTown
5    110 110000 CollegeCreek
6    100 100000      OldTown

Whereas select() gives you a subset of variables (or columns of the data frame), the filter() function will give you a subset of observations (or rows) of the data frame based on some criteria. For example, here is some code that will return only the observations where the sale price is greater than $300,000:

filter(Ames, PriceK > 300)

Edit the code below to filter for homes that cost more than 300K.

require(coursekata) # Modify this code filter() # Modify this code filter(Ames, PriceK > 300) ex() %>% check_output_expr("filter(Ames, PriceK > 300)")
CK Code: X1_Code_Selecting_02
 
 YearBuilt YearSold Neighborhood HomeSizeR HomeSizeK LotSizeR LotSizeK Floors
1      2007     2007 CollegeCreek      2696     2.696     9965    9.965      2
2      2004     2007 CollegeCreek      2000     2.000    10386   10.386      1
3      2000     2009 CollegeCreek      2153     2.153    11050   11.050      2
4      2006     2007 CollegeCreek      2828     2.828     9965    9.965      2
  BuildQuality     Foundation HasCentralAir Bathrooms Bedrooms TotalRooms
1            7 PouredConcrete             1         2        4         10
2            8 PouredConcrete             1         2        3          8
3            9 PouredConcrete             1         2        3          8
4            8 PouredConcrete             1         3        4         11
  KitchenQuality HasFireplace GarageType GarageCars PriceR PriceK
1      Excellent            1   Attached          3 383970 383.97
2           Good            0   Attached          3 305900 305.90
3      Excellent            1   Attached          3 313000 313.00
4           Good            1   Attached          3 424870 424.87
 

The function filter(), like select(), returns a data frame. In this case, the data frame only has four rows because only four observations in Ames had sale prices greater than $300K.

<p>Remember: even though <code>select()</code> and <code>filter()</code> both return data frames, those new data frames are just temporary unless you save them. If you want to save a data frame that includes only the variables PriceK and PriceR you would need to do something like this: <code>new_data_frame &lt;- select(Ames, PriceK, PriceR)</code>.</p>

Responses