5.3 Group comparisons

5.3.1 Summary statistics by group

Comparing outcomes across groups can reveal important patterns. We conduct such comparisons using the group_by() function, here in combination with summarize(). The group_by() function takes as arguments a comma separated list variables. R will then evaluate all piped commands separately for every combination of values that appear in your listed variables. Here’s an example that calculates the mean and standard deviation of math proficiency (ProfMath) separately for elementary, middle, and high school (SchType).

  dcps %>%
    group_by(SchType) %>%  # separately for each value of SchType
    summarize(
      Avg = mean(ProfMath),  # calculate mean of ProfMath
      StDev = sd(ProfMath)   # calculate SD of ProfMath
    )
## # A tibble: 3 × 3
##   SchType      Avg StDev
##   <fct>      <dbl> <dbl>
## 1 Elementary  34.0  23.7
## 2 Middle      19.6  17.6
## 3 High        12.9  22.5

Note: the arguments for group_by() should be categorical variables and not continuous variables. Why? What happens when you try using a continuous variable?

5.3.2 Visualize group differences

Plotting distributions separately for different groups can offer insights into the relationships between variables. Chapter 9 introduces plotting generally and Section 9.2 covers techniques for visualizing group differences and relationships between variables.