Text data can be tricky to work with. Add a separator like a comma, semicolon, or space, and it becomes even more challenging to manipulate into a format that is ready for analysis.
Although tools like Excel and Google Sheets provide UI to disentangle such text records, most only handle the simple use case of breaking the string into multiple columns in its original order.
Let’s a look at some sample data with results from a multiple choice survey question: What are your favorite colors?
Respondents can select as many colors as they desire from a list of seven. Five respondents take the survey and raw data is downloaded from the survey platform and imported into an R dataframe named survey
.
respondent | response |
---|---|
1 | blue;green;red |
2 | brown |
3 | orange;pink |
4 | black;red |
5 | pink;red |
The Text To Columns… feature in Excel is similar to separate() from dplyr in R, which needs to know names for which columns that results should be placed. I’ve manually added 1:7
because there are only seven distinct colors in the survey and therefore any given respondent could select seven at most.
library(tidyverse)
survey %>%
separate(col = response, sep = ";", into= str_c('color',1:7 )) # color1 through color7 column names
respondent | color1 | color2 | color3 | color4 | color5 | color6 | color7 |
---|---|---|---|---|---|---|---|
1 | blue | green | red | NA | NA | NA | NA |
2 | brown | NA | NA | NA | NA | NA | NA |
3 | orange | pink | NA | NA | NA | NA | NA |
4 | black | red | NA | NA | NA | NA | NA |
5 | pink | red | NA | NA | NA | NA | NA |
Wouldn’t it be better to have results from each unique response option grouped automatically? Enter the library splitstackshape by Ananda Mahto.
Its core function cSplit() would achieve the same result as separate() with less effort by running cSplit(indt = survey, splitCols = "response", sep = ";")
.
But to solve our problem we want to use the function cSplit_e(), passing in the following parameters.
library(splitstackshape)
cSplit_e(data = survey, split.col = "response", sep=";", type = "character")
respondent | response | response_black | response_blue | response_brown | response_green | response_orange | response_pink | response_red |
---|---|---|---|---|---|---|---|---|
1 | blue;green;red | NA | 1 | NA | 1 | NA | NA | 1 |
2 | brown | NA | NA | 1 | NA | NA | NA | NA |
3 | orange;pink | NA | NA | NA | NA | 1 | 1 | NA |
4 | black;red | 1 | NA | NA | NA | NA | NA | 1 |
5 | pink;red | NA | NA | NA | NA | NA | 1 | 1 |
The function automatically splits the data into as many unique values are contained within the column - regardless of what position they might are reported! If a given respondent did not mention a specific color, the cell becomes NA
. We could remove the original column by adding drop=TRUE
and change missing values to zero with fill=0
.
clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0)
clean
respondent | response_black | response_blue | response_brown | response_green | response_orange | response_pink | response_red |
---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
From here we can head back to the tidyverse to explore the survey results.
results <- clean %>%
select(contains("response_")) %>% # select the columns with color detail
pivot_longer(cols = everything()) %>% # make it long format
group_by(name) %>%
summarise(selected = sum(value)) %>% # count colors selected
mutate(percent = selected / nrow(survey)) %>% # calculate percent of total
arrange(desc(selected)) %>%
mutate(name = str_replace_all(name, "response_", "")) # get rid of prefix
results
Now we know how many people selected each color with 60 percent of respondents indicating that red
, for example, is one of their favorite colors.
name | selected | percent |
---|---|---|
red | 3 | 0.6 |
pink | 2 | 0.4 |
black | 1 | 0.2 |
blue | 1 | 0.2 |
brown | 1 | 0.2 |
green | 1 | 0.2 |
orange | 1 | 0.2 |
Finally, we turn to ggplot to make a simple visualization showcasing the relative popularity of the colors red
and pink
among our survey respondents.
ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
theme_minimal() +
scale_y_continuous(labels=percent) +
labs(title='Percent of Survey Respondents Selecting a Given Color',
x = NULL, y = NULL)
In just a few lines of code we can convert a messy text column with inconsistently delimited values into a tidy dataframe, calculate summary results, and generate a clean visualization to convey our findings.
Not bad at all.
library(tidyverse)
library(splitstackshape)
library(DT)
library(scales)
survey <- tibble(respondent = 1:5,
response = c('blue;green;red',
'brown',
'orange;pink',
'black;red',
'pink;red'
))
clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0)
results <- clean %>% select(contains("response_")) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(selected = sum(value)) %>%
mutate(percent = selected / nrow(survey)) %>%
arrange(desc(selected)) %>%
mutate(name = str_replace_all(name, "response_", ""))
ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
theme_minimal() +
scale_y_continuous(labels=percent) +
labs(title='Percent of Respondents Selecting a Given Color',
x = NULL, y = NULL)