8.1 Filter/subset data
It is often necessary to limit your analysis to some subset of observations. Use the filter() command to specify the criteria by which to select observations.
film_fem <- film %>% # create new object 'film_fem' starting with object 'film'
filter(SubjectSex == 'Female') # Choose observations where the variable
# 'SubjectSex' has a value of 'Female'
film_fem # Print 'film_fem' to the console## # A tibble: 177 × 10
## Title Release NumSubjects SubjectName SubjectType SubjectRace PersonOfColor
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 Big Ey… 2014 1 Margaret K… Artist White 0
## 2 Testam… 2014 1 Vera Britt… Other NA 0
## 3 The Br… 2014 1 Brittany M… Actress White 0
## 4 Wild 2014 1 Cheryl Str… Other NA 0
## 5 Diana 2013 1 Princess D… Other White 0
## 6 Lovela… 2013 1 Linda Love… Actress White 0
## 7 Philom… 2013 1 Philomena … Other White 0
## 8 Saving… 2013 2 P.L. Trave… Author White 0
## 9 Hitchc… 2012 2 Alma Revil… Other White 0
## 10 Hyde P… 2012 2 Margaret S… Other NA 0
## # ℹ 167 more rows
## # ℹ 3 more variables: SubjectSex <chr>, LeadActor <chr>, Period <chr>
The conditions inside filter() identify the observations, or rows, to keep (i.e. you’re selecting only those rows that satisfy the given conditions). This can be based on any number of conditions. Note that a double equals sign == is used to check equality while a single equals sign = is used to assign values. Since a filter() function is designed to evaluate criteria for each observation, rather than assign values, we will always use a double equals sign == with filter().
Filtering or subsetting by a text-based variable like SubjectSex requires that you enclose the desired value in single or double quotation marks, i.e. SubjectSex == 'Female' or SubjectSex == "Female", not SubjectSex == Female. Without quotation marks, R will look for an object named Female. However, if you want to filter a numeric variable, you will generally omit the quotation marks. For instance, say we are interested in films releasted in 2014:
film_14 <- film %>% # create new object 'film_14' starting with object 'film'
filter(Release == 2014) # Choose observations where the variable
# 'Release' has a value of 2014
film_14 # Print 'film_14' to the console## # A tibble: 33 × 10
## Title Release NumSubjects SubjectName SubjectType SubjectRace PersonOfColor
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1987 2014 1 Ricardo Tr… Other White 0
## 2 Americ… 2014 1 Chris Kyle Military NA 0
## 3 Big Ey… 2014 1 Margaret K… Artist White 0
## 4 Cesar … 2014 1 Cesar Chav… Activist Hispanic (… 1
## 5 Desert… 2014 1 Afshin Gha… Artist Middle Eas… 1
## 6 Diggin… 2014 1 Adam Green Other NA 0
## 7 Exodus… 2014 1 Moses Historical Middle Eas… 0
## 8 Foxcat… 2014 1 Mark Schul… Athlete White 0
## 9 Get on… 2014 1 James Brown Singer African Am… 1
## 10 Grace … 2014 1 Grace Kelly Actress White 0
## # ℹ 23 more rows
## # ℹ 3 more variables: SubjectSex <chr>, LeadActor <chr>, Period <chr>