Hadley Wickham’s dplyr package is an incredibly powerful R package for data analysis. A common data analysis technique, known as split-apply-combine
, involves creating statistical summaries by groups within a data frame.
This post will introduce the power of using logical vectors within your dplyr
code to create complex data summaries with ease.
Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age
vector below.
age <- c(23, 31, 27, 41, 54, 34, 25)
age
## [1] 23 31 27 41 54 34 25
What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents?
We can answer this question by first using R’s >=
operator to find where values stored in the age
vector are greater than or equal to the value 30. Anytime we use R’s comparison operators (>
, >=
, <
, <=
, ==
) on a vector, we will get a logical vector consisting of TRUE/FALSE values indicating where our condition was met.
For example, running the code below produces a sequence of TRUE/FALSE values that test where our respondents are 30 or older in the age
vector.
age >= 30
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE
To answer our question above, we can use the following properties of logical vectors in R:
We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.
sum(age >= 30)
## [1] 4
mean(age >= 30)
## [1] 0.5714286
Let’s go through a simple example where using these two properties can help with performing complex statistical summaries with dplyr
.
We will be working with a subset of the mpg
dataset, which is automatically loaded with the tidyverse
package in R.
This dataset contains the fuel efficiency and other interesting properties of 234 cars.
library(tidyverse)
mpg_df <- mpg %>%
select(manufacturer, model, drv, hwy)
mpg_df %>% head()
## # A tibble: 6 x 4
## manufacturer model drv hwy
## <chr> <chr> <chr> <int>
## 1 audi a4 f 29
## 2 audi a4 f 29
## 3 audi a4 f 31
## 4 audi a4 f 30
## 5 audi a4 f 26
## 6 audi a4 f 26
Using the split-apply-combine
technique with dplyr
usually involves taking a data frame, forming subsets with the group_by()
function, applying a summary function to to the groups, and collecting the results into a single data frame.
A simple example would be to answer the following questions about our subset of mpg
:
How many cars are there by manufacturer? What is the average highway fuel efficiency by manufacturer?
The code below answers these questions with ease.
mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n(),
avg_hwy = mean(hwy))
## # A tibble: 15 x 3
## manufacturer n_cars avg_hwy
## <chr> <int> <dbl>
## 1 audi 18 26.4
## 2 chevrolet 19 21.9
## 3 dodge 37 17.9
## 4 ford 25 19.4
## 5 honda 9 32.6
## 6 hyundai 14 26.9
## 7 jeep 8 17.6
## 8 land rover 4 16.5
## 9 lincoln 3 17
## 10 mercury 4 18
## 11 nissan 13 24.6
## 12 pontiac 5 26.4
## 13 subaru 14 25.6
## 14 toyota 34 24.9
## 15 volkswagen 27 29.2
What if someone asked us the following questions:
How many cars have a highway fuel efficiency greater than 16, by manufacturer? What proportion of the total cars does this group represent within each manufacturer?
This question can be answered without the use of logical vectors, but it involves a surprising amount of work! The steps are listed below:
hwy
value greater than 16 into a separate data frameThe R code below implements this logic.
# Counts by manufacturer
cars_by_manuf <- mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n())
# Counts by manufacturer for hwy > 16
cars_by_manuf_16 <- mpg_df %>%
filter(hwy > 16) %>%
group_by(manufacturer) %>%
summarise(n_cars_16 = n())
# Combine into one data frame and compute proportion within each group
result <- cars_by_manuf %>%
left_join(cars_by_manuf_16, by = 'manufacturer') %>%
mutate(prop_cars_16 = n_cars_16/n_cars)
# View results
result
## # A tibble: 15 x 4
## manufacturer n_cars n_cars_16 prop_cars_16
## <chr> <int> <int> <dbl>
## 1 audi 18 18 1
## 2 chevrolet 19 16 0.842
## 3 dodge 37 25 0.676
## 4 ford 25 22 0.88
## 5 honda 9 9 1
## 6 hyundai 14 14 1
## 7 jeep 8 6 0.75
## 8 land rover 4 2 0.5
## 9 lincoln 3 2 0.667
## 10 mercury 4 4 1
## 11 nissan 13 13 1
## 12 pontiac 5 5 1
## 13 subaru 14 14 1
## 14 toyota 34 33 0.971
## 15 volkswagen 27 27 1
I wasn’t joking when I said that it was a surprising amount of work! Let’s see how logical vectors can come to our rescue.
Using the two properties of logical vectors from above, we can compute the results in a single dplyr expression.
mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n(),
n_cars_16 = sum(hwy > 16),
prop_cars_16 = mean(hwy > 16))
## # A tibble: 15 x 4
## manufacturer n_cars n_cars_16 prop_cars_16
## <chr> <int> <int> <dbl>
## 1 audi 18 18 1
## 2 chevrolet 19 16 0.842
## 3 dodge 37 25 0.676
## 4 ford 25 22 0.88
## 5 honda 9 9 1
## 6 hyundai 14 14 1
## 7 jeep 8 6 0.75
## 8 land rover 4 2 0.5
## 9 lincoln 3 2 0.667
## 10 mercury 4 4 1
## 11 nissan 13 13 1
## 12 pontiac 5 5 1
## 13 subaru 14 14 1
## 14 toyota 34 33 0.971
## 15 volkswagen 27 27 1
Who knew that logical vectors where the secret to simple and efficient dplyr code. As Alex has mentioned before, small wins aren’t life changing, but if you find enough of them, things start to feel a lot easier.