Column manipulation R

Column manipulation R

I'm pretty new to working with R and I was hoping someone could point me in the right way. I have a data set in the form of a tibble, and I need to go through each row.
In each row I have to check columns in sets of 3.
i.e if one of the columns value=0 I have to delete all three columns and evaluate the next 3 columns.

data_set <- as.data.frame(matrix(nrow=2)) data_set$Basket1<- c(45,35) data_set$Type1 <- c("Normal","Premium") data_set$Amount1 <- c(4,5) data_set$Basket2 <- c(4,98) data_set$Type2 <- c("Normal","Normal") data_set$Amount2 <- c(0,4) #when Type is "Premium" I want to remove the values for #Basket1,Type1,Amount1 #and shift the next 3 cells to the left

Please include a reproducible example showing some example data, the code you have tried so far, and the output you're expecting.
– Marius
Jul 3 at 0:04

when you say "delete all three columns" do you mean what it sounds like? eg if any row of the column has a zero, all three columns are removed from the table altogether? Or do you just mean the values in that one row become NA?
– Peter Ellis
Jul 3 at 0:07

@Zoe, please post samples in your posts, comments are NOT meant for that, let us know once you do so.
– RavinderSingh13
Jul 3 at 0:19

Sorry, just updated the post. Not sure how to format the data to a table in the post. First time using this
– Zoe
Jul 3 at 0:22

@Zoe for reference on creating a reproducible example in R you can check out this post.
– Mihai Chelaru
Jul 3 at 0:29

4 Answers
4

Based on the discussion in the comments to my previous answer, here is another way of doing it that is robust to not knowing the number of columns in advance. In this situation it always pays to first convert the data into a known shape, and the only shape it can be in this situation is a long thin one. Once you've got it in that condition, you can group it by customer id and do whatever manipulations you need that way:

library(tidyverse) data_set %>% # add a unique identifier for the original shape: mutate(customer = 1:n()) %>% # turn into a long thin format: gather(variable, value, -customer) %>% # remove any NA values from the unbalanced table (and the empty column at the beginning of the original) filter(!is.na(value)) %>% # extract the numbers from the original column names: mutate(column_set = as.numeric(str_extract(variable, "[0-9]+")), variable = gsub("[0-9]+", "", variable)) %>% # limit the data to the first "normal" column set for each customer group_by(customer) %>% mutate(best_column_set = min(column_set[value == "Normal"])) %>% filter(column_set == best_column_set) %>% # drop the columns we don't need and return to wide format: select(-column_set, -best_column_set) %>% spread(variable, value) %>% # convert from characters back to numbers mutate(Basket = as.numeric(Basket), Amount = as.numeric(Amount))

This returns:

# A tibble: 2 x 4 # Groups: customer [2] customer Amount Basket Type <int> <dbl> <dbl> <chr> 1 1 4 45 Normal 2 2 4 98 Normal

This method depends on their being quite some structure in how the original columns are labelled; for example that they are numbered by column set, and that there aren't any other numbers in the column names.

Is there an email address I can contact you or is that frowned upon on here? I feel like I have confused everyone and I can explain it better with an excel file
– Zoe
Jul 3 at 2:30

sure. peter.ellis2013nz at gmail dot com
– Peter Ellis
Jul 3 at 2:40

Thanks. Sent you an email.
– Zoe
Jul 3 at 2:55

Please see my update at the end based on the new example you added.

#I have a data set in the form of a tibble data_set <- as.data.frame(matrix(nrow=8)) data_set$column1_set1 <- c(1,1,1,1,1,1,1,1) data_set$column2_set1 <- c(1,1,1,1,0,1,1,1) data_set$column3_set1 <- c(1,1,1,1,1,1,1,1) data_set$column1_set2 <- c(1,1,1,1,1,1,1,1) data_set$column2_set2 <- c(1,1,1,1,1,1,1,1) data_set$column3_set2 <- c(1,1,1,1,1,1,1,1) data_set$V1 <- NULL data_set <- as.tibble(data_set) # In each row I have to check columns in sets of 3. # i.e if one of the columns value=0 I have to delete all three columns # and evaluate the next 3 columns.

You can do so like this:

cn <- colnames(data_set) for(i in seq(1,length(cn),3)){ if(any(colSums(data_set[,i:(i+2)]) < nrow(data_set))){ data_set <- data_set[,!colnames(data_set) %in% cn[i:(i+2)]] } else{ next } }

In the new example we have some non-numeric columns. The only change we have to make is to check if they are numeric first.

cn <- colnames(data_set) for(i in seq(1,length(cn),3)){ cn_tmp <- cn[i:(i+2)] cn_tmp <- ifelse(class(data_set[,colnames(data_set) %in% cn_tmp])=="numeric", cn_tmp, cn_tmp[!cn_tmp==cn_tmp[i]]) cn_tmp <- ifelse(class(data_set[,colnames(data_set) %in% cn_tmp])=="numeric", cn_tmp, cn_tmp[!cn_tmp==cn_tmp[i+1]]) cn_tmp <- ifelse(class(data_set[,colnames(data_set) %in% cn_tmp])=="numeric", cn_tmp, cn_tmp[!cn_tmp==cn_tmp[i+2]]) if(any(colSums(data_set[,colnames(data_set) %in% cn_tmp]) < nrow(data_set))){ data_set <- data_set[,!colnames(data_set) %in% cn[i:(i+2)]] } else{ next } }

Thank you I updated the post with formatted code and comments
– Zoe
Jul 3 at 0:46

@Zoe OK based on your example I will edit my solution to check if the column is numeric
– Hack-R
Jul 3 at 0:49

I would strongly recommend not thinking in terms of shifting values leftwards or rightwards - it's not making the most of R's data frame objects, where the columns should be seen as having integrity. So instead of shifting and deleting cells, I think you should add new columns to the right based on the logic you need and then (if necessary) delete all the original columns. Here's one way to do this, creating new amount_n, basket_n and type_n columns and discarding the rest at the end:

library(dplyr) data_set <- data_set %>% mutate( basket_n = case_when( # If Type1 is Normal we use its basket: Type1 == "Normal" ~ Basket1, # If not, then see if Type2 is normal and we can use that (and so on): Type2 == "Normal" ~ Basket2 ), amount_n = case_when( Type1 == "Normal" ~ Amount1, Type2 == "Normal" ~ Amount2 ), type_n = "Normal" ) %>% select(type_n, basket_n, amount_n)

How do I change this so that I don't have to refer to the column name? This is an unbalanced table so each row can have different multiples of 3 column sets. For ex: if each row is a customer the 3 column set could be different baskets they bought. Different customers can have different number of baskets.
– Zoe
Jul 3 at 1:55

You've lost me I'm afraid. Are you saying you don't know how many columns you have in advance? (because it is defined dynamically and you are trying to automate this)?
– Peter Ellis
Jul 3 at 2:03

Yes, I wouldn't know how many column columns each row has. I'm trying to run association rules so the end goal is to have only the values for Basket for each record. So for data cleaning I have to remove the "baskets" where Type=Premium or Type=Normal and Amount=0
– Zoe
Jul 3 at 2:03

I would take a different approach, more like a database developer, and normalise the data into a long thin version. I'll do a separate answer.
– Peter Ellis
Jul 3 at 2:04

data=read.table(text=" Basket1 Type1 Amount1 Basket2 Type2 Amount2 Basket3 Type3 Amount3 1 Normal 1 10 Normal 3 12 Premium 0 2 Normal 0 3 Normal 3 45 Premium 3 3 Normal 1 3 Normal 3 3 Premium 67 4 Normal 1 5 Normal 3 54 Premium 45",h=T)

to shift left, you can do:

data%>% rownames_to_column%>% reshape(matrix(2:ncol(.),3),idvar = 1,dir="long")%>% filter(!rowSums(.==0)>0)%>% group_by(rowname)%>% mutate(time=1:n())%>% arrange(time,rowname)%>% data.frame()%>% reshape(timevar = "time",idvar = "rowname",dir="wide")%>% select(-rowname)%>% rename_all(~sub("d.","",.x)) Basket1 Type1 Amount1 Basket2 Type2 Amount2 Basket3 Type3 Amount3 1 1 Normal 1 10 Normal 3 NA <NA> NA 2 3 Normal 3 45 Premium 3 NA <NA> NA 3 3 Normal 1 3 Normal 3 3 Premium 67 4 4 Normal 1 5 Normal 3 54 Premium 45

EDIT::

For the data that you provided later:

data_set[-1]%>% rownames_to_column%>% reshape(matrix(2:ncol(.),3),idvar = 1,dir="long")%>% filter(!rowSums(.==0)>0)%>% group_by(rowname)%>%mutate(time=1:n())%>% arrange(time,rowname)%>% data.frame()%>% reshape(timevar = "time",idvar = "rowname",dir="wide")%>% select(-rowname)%>% rename_all(~sub("d.","",.x)) Basket1 Type1 Amount1 Basket2 Type2 Amount2 1 45 Normal 4 NA <NA> NA 2 35 Premium 5 98 Normal 4

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

OpZ BzcRM 0tGJvtg dBY7j640DYPNGdMdKvhGKyDvjEVAIJjvlZ3 mu Cdd MG4IWn83Wu3nDOMO

搜尋此網誌

Fjhtyj