aggregating columns based on commas

Multi tool use
aggregating columns based on commas
I have the following dataframe and I'm trying to separate the commas and turn that particular name(s) into their own individual columns and specify if that particular column names exist (which are separated by commas) for that particular ID. (1 = Yes, 0 = No) Any help would be appreciated! Thanks!
ID<- c(1,2,3,4,5,6)
Details<- c("V1,V2", "V1,V3", "V1", "V2", "V3,V4", "V2,V3" )
data.frame <- data.frame(ID, Details, stringsAsFactors=FALSE)
DESIRED OUTPUT:
ID<-c(1,2,3,4,5,6)
V1<-c(1,1,1,0,0,0)
V2<-c(1,0,0,1,0,1)
V3<-c(0,1,0,0,1,1)
V4<-c(0,0,0,0,1,0)
data.frame1<-data.frame(ID, V1, V2, V3, V4, stringsAsFactors=FALSE)
5 Answers
5
A solution using the tidyverse
package. dat
is your example data frame. dat2
is the final data frame.
tidyverse
dat
dat2
library(tidyverse)
dat2 <- dat %>%
separate_rows(Details) %>%
mutate(Value = 1L) %>%
spread(Details, Value, fill = 0L)
dat2
# ID V1 V2 V3 V4
# 1 1 1 1 0 0
# 2 2 1 0 1 0
# 3 3 1 0 0 0
# 4 4 0 1 0 0
# 5 5 0 0 1 1
# 6 6 0 1 1 0
One option with mtabulate
from qdapTools
mtabulate
qdapTools
library(qdapTools)
cbind.data.frame(ID, # or data.frame$ID
mtabulate(strsplit(as.character(data.frame$Details), ",")))
# output
ID V1 V2 V3 V4
1 1 1 1 0 0
2 2 1 0 1 0
3 3 1 0 0 0
4 4 0 1 0 0
5 5 0 0 1 1
6 6 0 1 1 0
Here is a base R solution. I have renamed your data.frames data1
and data2
.
data1
data2
data1 <- data.frame(ID, Details, stringsAsFactors=FALSE)
data2 <- data.frame(ID, V1, V2, V3, V4, stringsAsFactors=FALSE)
nms <- unique(unlist(strsplit(data1$Details, ",")))
data3 <- cbind.data.frame(ID, sapply(nms, grepl, data1$Details))
data3[-1] <- lapply(data3[-1], as.integer)
Now compare data3
with your expected result data2
.
data3
data2
all.equal(data2, data3)
#[1] TRUE
Note, however, that
identical(data2, data3)
#[1] FALSE
This is because I have used as.integer
and the values in data2
are of class "numeric"
. If this makes a difference, you can change the lapply
instruction above to use as.numeric
.
as.integer
data2
"numeric"
lapply
as.numeric
using base R:
xtabs(val~.,cbind.data.frame(ID=rep(ID,lengths(s<-strsplit(Details,","))),Details=unlist(s),val=1))
Details
ID V1 V2 V3 V4
1 1 1 0 0
2 1 0 1 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 1
6 0 1 1 0
The most straightforward way I see is to would be to build a data.frame for each of these vectors hidden in strings and bind them. purrr
can help to make it quite compact. Note that column ID
isn't needed, I'll work on Details
directly.
purrr
ID
Details
library(purrr)
df <- map_dfr(strsplit(Details, ","),
~data.frame(t(setNames(rep(1, length(.x)), .x))))
df[is.na(df)] <- 0
# V1 V2 V3 V4
# 1 1 1 0 0
# 2 1 0 1 0
# 3 1 0 0 0
# 4 0 1 0 0
# 5 0 0 1 1
# 6 0 1 1 0
You could also split and unlist to get distinct values, and then look them up in the original vector:
unique_v <- unique(unlist(strsplit(Details, ",")))
map_dfc(unique_v, ~as.numeric(grepl(.x, Details)))
# # A tibble: 6 x 4
# V1 V2 V3 V4
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 0
# 2 1 0 1 0
# 3 1 0 0 0
# 4 0 1 0 0
# 5 0 0 1 1
# 6 0 1 1 0
We could do some dirty string evaluation also if you know the number of columns:
m <- as.data.frame(matrix(0,ncol=4,nrow=6))
eval(parse(text=paste0("m[",ID,", c(",gsub("V","",Details),")] <- 1")))
# V1 V2 V3 V4
# 1 1 1 0 0
# 2 1 0 1 0
# 3 1 0 0 0
# 4 0 1 0 0
# 5 0 0 1 1
# 6 0 1 1 0
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.