6.3 Means of binary variables
In most cases, calculating summary statistics for categorical variables does not yield interpretable results. As an example, imagine we asked 5 Dutch people how they commuted on this morning. Their answers include biking, walking, and the bus. The result is a categorical variable that we could store in several different ways.
- As text, with the strings
Bike,Walk, andBusindicating the different categories - As numbers, with
0corresponding to biking,1to walking, and2to the bus - As numbers, with
1corresponding to biking,2to walking, and3to the bus
We could implement each of these coding schemes in R, creating a new object with 5 observations and 3 variables:
commute <- tibble( # create a new object 'commute' defined as a tidyverse tibble
text = c("Bike","Bike","Walk","Walk","Bus"), # variable 'text' stores commute as text
code1 = c(0,0,1,1,2), # variable 'code1' stores commute as numerical codes 0,1,2
code2 = c(1,1,2,2,3) # variable 'code2' stores commute as numerical codes 1,2,3
)
commute # print object 'commute' to console## # A tibble: 5 × 3
## text code1 code2
## <chr> <dbl> <dbl>
## 1 Bike 0 1
## 2 Bike 0 1
## 3 Walk 1 2
## 4 Walk 1 2
## 5 Bus 2 3
All 3 of these variables contain exactly the same information, just stored differently. What if we tried to take the means of each?
commute %>%
summarize(mean_text = mean(text), # create a new variable 'mean_text' equal to the mean of the variable 'text'
mean_code1 = mean(code1), # create a new variable 'mean_code1' equal to the mean of the variable 'code1'
mean_code2 = mean(code2)) # create a new variable 'mean_code2' equal to the mean of the variable 'code2'## # A tibble: 1 × 3
## mean_text mean_code1 mean_code2
## <dbl> <dbl> <dbl>
## 1 NA 0.8 1.8
The results are not terribly useful. Arithmetic operations are not defined for text, so we get a missing value for the mean of text. Since the other variables use a numerical coding scheme, it is possible to calculate a mean, but the results of 0.8 and 1.8 have two obvious deficiencies. First, they are different from each other even though the underlying information is exactly the same. Second, they are not comprehensible answers to the question “what modes of commuting are typical for our Dutch survey respondents?”
But let’s consider one more way of representing the data: a series of binary categorical variables, one for each mode of commuting. These variables will share a coding scheme, assuming a value of 0 if the commuter did not use that mode and a value of 1 if the commuter did use that mode. Here’s what the data would look like under that coding scheme.
commute_binary <- tibble( # create a new object 'commute_binary' defined as a tidyverse tibble
bike = c(1,1,0,0,0), # variable 'bike' assumes a value of 1 for bike commutes, 0 otherwise
walk = c(0,0,1,1,0), # variable 'walk' assumes a value of 1 for walking commutes, 0 otherwise
bus = c(0,0,0,0,1) # variable 'bus' assumes a value of 1 for bus commutes, 0 otherwise
)
commute_binary # print object 'commute_binary' to console## # A tibble: 5 × 3
## bike walk bus
## <dbl> <dbl> <dbl>
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
This table again stores exactly the same information, but now we can learn something from the means.
commute_binary %>%
summarize(mean_bike = mean(bike), # create a new variable 'mean_bike' equal to the mean of the variable 'bike'
mean_walk = mean(walk), # create a new variable 'mean_walk' equal to the mean of the variable 'walk'
mean_bus = mean(bus)) # create a new variable 'mean_bus' equal to the mean of the variable 'bus'## # A tibble: 1 × 3
## mean_bike mean_walk mean_bus
## <dbl> <dbl> <dbl>
## 1 0.4 0.4 0.2
These means correspond to the fraction of survey respondents who used each mode of transport; 40 percent biked, 40 percent walked, and 20 percent rode the bus. In general, the mean of an binary variable coded as (0,1) will give you the fraction of observations with a value of 1.