Switch to English Site

The power of dplyr and logical vectors

The power of dplyr and logical vectors

2020年7月5日

Hadley Wickham’s dplyr package is an incredibly powerful R package for data analysis. A common data analysis technique, known as split-apply-combine, involves creating statistical summaries by groups within a data frame.

This post will introduce the power of using logical vectors within your dplyr code to create complex data summaries with ease.

Special Properties of Logical Vectors in R

Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age vector below.

age <- c(23, 31, 27, 41, 54, 34, 25)

age
## [1] 23 31 27 41 54 34 25

What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents?

We can answer this question by first using R’s >= operator to find where values stored in the age vector are greater than or equal to the value 30. Anytime we use R’s comparison operators (>>=<<===) on a vector, we will get a logical vector consisting of TRUE/FALSE values indicating where our condition was met.

For example, running the code below produces a sequence of TRUE/FALSE values that test where our respondents are 30 or older in the age vector.

age >= 30
## [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Two Important Operations on Logical Vectors in R

To answer our question above, we can use the following properties of logical vectors in R:

  • the sum of a logical vector returns the number of TRUE values
  • the mean of a logical vector returns the proportion of TRUE values

We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.

sum(age >= 30)
## [1] 4
mean(age >= 30)
## [1] 0.5714286

How Can This Help When Using dplyr?

Let’s go through a simple example where using these two properties can help with performing complex statistical summaries with dplyr.

We will be working with a subset of the mpg dataset, which is automatically loaded with the tidyverse package in R.

This dataset contains the fuel efficiency and other interesting properties of 234 cars.

library(tidyverse)

mpg_df <- mpg %>%
          select(manufacturer, model, drv, hwy)

mpg_df %>% head()
## # A tibble: 6 x 4
##   manufacturer model drv     hwy
##   <chr>        <chr> <chr> <int>
## 1 audi         a4    f        29
## 2 audi         a4    f        29
## 3 audi         a4    f        31
## 4 audi         a4    f        30
## 5 audi         a4    f        26
## 6 audi         a4    f        26

A Simple Example

Using the split-apply-combine technique with dplyr usually involves taking a data frame, forming subsets with the group_by() function, applying a summary function to to the groups, and collecting the results into a single data frame.

A simple example would be to answer the following questions about our subset of mpg:

How many cars are there by manufacturer? What is the average highway fuel efficiency by manufacturer?

The code below answers these questions with ease.

mpg_df %>% group_by(manufacturer) %>% 
           summarise(n_cars = n(),
                     avg_hwy = mean(hwy))
## # A tibble: 15 x 3
##    manufacturer n_cars avg_hwy
##    <chr>         <int>   <dbl>
##  1 audi             18    26.4
##  2 chevrolet        19    21.9
##  3 dodge            37    17.9
##  4 ford             25    19.4
##  5 honda             9    32.6
##  6 hyundai          14    26.9
##  7 jeep              8    17.6
##  8 land rover        4    16.5
##  9 lincoln           3    17  
## 10 mercury           4    18  
## 11 nissan           13    24.6
## 12 pontiac           5    26.4
## 13 subaru           14    25.6
## 14 toyota           34    24.9
## 15 volkswagen       27    29.2

A More Challenging Question

What if someone asked us the following questions:

How many cars have a highway fuel efficiency greater than 16, by manufacturer? What proportion of the total cars does this group represent within each manufacturer?

Without Using Logical Vectors

This question can be answered without the use of logical vectors, but it involves a surprising amount of work! The steps are listed below:

  • We must calculate the number of cars by manufacturer and store it in a new data frame
  • Next we calculate the number of cars by manufacturer that have a hwy value greater than 16 into a separate data frame
  • Finally we join the data together into our final result and calculate the proportion

The R code below implements this logic.

# Counts by manufacturer
cars_by_manuf <- mpg_df %>% group_by(manufacturer) %>% 
                 summarise(n_cars = n())

# Counts by manufacturer for hwy > 16
cars_by_manuf_16 <- mpg_df %>% 
                    filter(hwy > 16) %>% 
                    group_by(manufacturer) %>% 
                    summarise(n_cars_16 = n())

# Combine into one data frame and compute proportion within each group
result <- cars_by_manuf %>% 
          left_join(cars_by_manuf_16, by = 'manufacturer') %>%
          mutate(prop_cars_16 = n_cars_16/n_cars)

# View results
result
## # A tibble: 15 x 4
##    manufacturer n_cars n_cars_16 prop_cars_16
##    <chr>         <int>     <int>        <dbl>
##  1 audi             18        18        1    
##  2 chevrolet        19        16        0.842
##  3 dodge            37        25        0.676
##  4 ford             25        22        0.88 
##  5 honda             9         9        1    
##  6 hyundai          14        14        1    
##  7 jeep              8         6        0.75 
##  8 land rover        4         2        0.5  
##  9 lincoln           3         2        0.667
## 10 mercury           4         4        1    
## 11 nissan           13        13        1    
## 12 pontiac           5         5        1    
## 13 subaru           14        14        1    
## 14 toyota           34        33        0.971
## 15 volkswagen       27        27        1

I wasn’t joking when I said that it was a surprising amount of work! Let’s see how logical vectors can come to our rescue.

Using Logical Vectors

Using the two properties of logical vectors from above, we can compute the results in a single dplyr expression.

mpg_df %>% group_by(manufacturer) %>% 
           summarise(n_cars = n(),
                     n_cars_16 = sum(hwy > 16), 
                     prop_cars_16 = mean(hwy > 16))
## # A tibble: 15 x 4
##    manufacturer n_cars n_cars_16 prop_cars_16
##    <chr>         <int>     <int>        <dbl>
##  1 audi             18        18        1    
##  2 chevrolet        19        16        0.842
##  3 dodge            37        25        0.676
##  4 ford             25        22        0.88 
##  5 honda             9         9        1    
##  6 hyundai          14        14        1    
##  7 jeep              8         6        0.75 
##  8 land rover        4         2        0.5  
##  9 lincoln           3         2        0.667
## 10 mercury           4         4        1    
## 11 nissan           13        13        1    
## 12 pontiac           5         5        1    
## 13 subaru           14        14        1    
## 14 toyota           34        33        0.971
## 15 volkswagen       27        27        1

Who knew that logical vectors where the secret to simple and efficient dplyr code. As Alex has mentioned before, small wins aren’t life changing, but if you find enough of them, things start to feel a lot easier.

相关课程

相关学习路径

Coursera
Johns Hopkins University