6.3 Means of binary variables

In most cases, calculating summary statistics for categorical variables does not yield interpretable results. As an example, imagine we asked 5 Dutch people how they commuted on this morning. Their answers include biking, walking, and the bus. The result is a categorical variable that we could store in several different ways.

  • As text, with the strings Bike,Walk, and Bus indicating the different categories
  • As numbers, with 0 corresponding to biking, 1 to walking, and 2 to the bus
  • As numbers, with 1 corresponding to biking, 2 to walking, and 3 to the bus

We could implement each of these coding schemes in R, creating a new object with 5 observations and 3 variables:

commute <- tibble( # create a new object 'commute' defined as a tidyverse tibble
                  text = c("Bike","Bike","Walk","Walk","Bus"), # variable 'text' stores commute as text
                  code1 = c(0,0,1,1,2), # variable 'code1' stores commute as numerical codes 0,1,2
                  code2 = c(1,1,2,2,3)  # variable 'code2' stores commute as numerical codes 1,2,3
                  )

commute # print object 'commute' to console
## # A tibble: 5 × 3
##   text  code1 code2
##   <chr> <dbl> <dbl>
## 1 Bike      0     1
## 2 Bike      0     1
## 3 Walk      1     2
## 4 Walk      1     2
## 5 Bus       2     3

All 3 of these variables contain exactly the same information, just stored differently. What if we tried to take the means of each?

commute %>%
  summarize(mean_text = mean(text), # create a new variable 'mean_text' equal to the mean of the variable 'text'
            mean_code1 = mean(code1), # create a new variable 'mean_code1' equal to the mean of the variable 'code1'
            mean_code2 = mean(code2)) # create a new variable 'mean_code2' equal to the mean of the variable 'code2'
## # A tibble: 1 × 3
##   mean_text mean_code1 mean_code2
##       <dbl>      <dbl>      <dbl>
## 1        NA        0.8        1.8

The results are not terribly useful. Arithmetic operations are not defined for text, so we get a missing value for the mean of text. Since the other variables use a numerical coding scheme, it is possible to calculate a mean, but the results of 0.8 and 1.8 have two obvious deficiencies. First, they are different from each other even though the underlying information is exactly the same. Second, they are not comprehensible answers to the question “what modes of commuting are typical for our Dutch survey respondents?”

But let’s consider one more way of representing the data: a series of binary categorical variables, one for each mode of commuting. These variables will share a coding scheme, assuming a value of 0 if the commuter did not use that mode and a value of 1 if the commuter did use that mode. Here’s what the data would look like under that coding scheme.

commute_binary <- tibble( # create a new object 'commute_binary' defined as a tidyverse tibble
                  bike = c(1,1,0,0,0), # variable 'bike' assumes a value of 1 for bike commutes, 0 otherwise
                  walk = c(0,0,1,1,0), # variable 'walk' assumes a value of 1 for walking commutes, 0 otherwise
                  bus = c(0,0,0,0,1)  # variable 'bus' assumes a value of 1 for bus commutes, 0 otherwise
                  )

commute_binary # print object 'commute_binary' to console
## # A tibble: 5 × 3
##    bike  walk   bus
##   <dbl> <dbl> <dbl>
## 1     1     0     0
## 2     1     0     0
## 3     0     1     0
## 4     0     1     0
## 5     0     0     1

This table again stores exactly the same information, but now we can learn something from the means.

commute_binary %>%
  summarize(mean_bike = mean(bike), # create a new variable 'mean_bike' equal to the mean of the variable 'bike'
            mean_walk = mean(walk), # create a new variable 'mean_walk' equal to the mean of the variable 'walk'
            mean_bus = mean(bus)) # create a new variable 'mean_bus' equal to the mean of the variable 'bus'
## # A tibble: 1 × 3
##   mean_bike mean_walk mean_bus
##       <dbl>     <dbl>    <dbl>
## 1       0.4       0.4      0.2

These means correspond to the fraction of survey respondents who used each mode of transport; 40 percent biked, 40 percent walked, and 20 percent rode the bus. In general, the mean of an binary variable coded as (0,1) will give you the fraction of observations with a value of 1.