8.1 Filter/subset data

It is often necessary to limit your analysis to some subset of observations. Use the filter() command to specify the criteria by which to select observations.

  film_fem <- film %>% # create new object 'film_fem' starting with object 'film'
    filter(SubjectSex == 'Female') # Choose observations where the variable
      # 'SubjectSex' has a value of 'Female'

  film_fem # Print 'film_fem' to the console
## # A tibble: 177 × 10
##    Title   Release NumSubjects SubjectName SubjectType SubjectRace PersonOfColor
##    <chr>     <dbl>       <dbl> <chr>       <chr>       <chr>               <dbl>
##  1 Big Ey…    2014           1 Margaret K… Artist      White                   0
##  2 Testam…    2014           1 Vera Britt… Other       NA                      0
##  3 The Br…    2014           1 Brittany M… Actress     White                   0
##  4 Wild       2014           1 Cheryl Str… Other       NA                      0
##  5 Diana      2013           1 Princess D… Other       White                   0
##  6 Lovela…    2013           1 Linda Love… Actress     White                   0
##  7 Philom…    2013           1 Philomena … Other       White                   0
##  8 Saving…    2013           2 P.L. Trave… Author      White                   0
##  9 Hitchc…    2012           2 Alma Revil… Other       White                   0
## 10 Hyde P…    2012           2 Margaret S… Other       NA                      0
## # ℹ 167 more rows
## # ℹ 3 more variables: SubjectSex <chr>, LeadActor <chr>, Period <chr>

The conditions inside filter() identify the observations, or rows, to keep (i.e. you’re selecting only those rows that satisfy the given conditions). This can be based on any number of conditions. Note that a double equals sign == is used to check equality while a single equals sign = is used to assign values. Since a filter() function is designed to evaluate criteria for each observation, rather than assign values, we will always use a double equals sign == with filter().

Filtering or subsetting by a text-based variable like SubjectSex requires that you enclose the desired value in single or double quotation marks, i.e. SubjectSex == 'Female' or SubjectSex == "Female", not SubjectSex == Female. Without quotation marks, R will look for an object named Female. However, if you want to filter a numeric variable, you will generally omit the quotation marks. For instance, say we are interested in films releasted in 2014:

  film_14 <- film %>% # create new object 'film_14' starting with object 'film'
    filter(Release == 2014) # Choose observations where the variable
      # 'Release' has a value of 2014

  film_14 # Print 'film_14' to the console
## # A tibble: 33 × 10
##    Title   Release NumSubjects SubjectName SubjectType SubjectRace PersonOfColor
##    <chr>     <dbl>       <dbl> <chr>       <chr>       <chr>               <dbl>
##  1 1987       2014           1 Ricardo Tr… Other       White                   0
##  2 Americ…    2014           1 Chris Kyle  Military    NA                      0
##  3 Big Ey…    2014           1 Margaret K… Artist      White                   0
##  4 Cesar …    2014           1 Cesar Chav… Activist    Hispanic (…             1
##  5 Desert…    2014           1 Afshin Gha… Artist      Middle Eas…             1
##  6 Diggin…    2014           1 Adam Green  Other       NA                      0
##  7 Exodus…    2014           1 Moses       Historical  Middle Eas…             0
##  8 Foxcat…    2014           1 Mark Schul… Athlete     White                   0
##  9 Get on…    2014           1 James Brown Singer      African Am…             1
## 10 Grace …    2014           1 Grace Kelly Actress     White                   0
## # ℹ 23 more rows
## # ℹ 3 more variables: SubjectSex <chr>, LeadActor <chr>, Period <chr>