Count the repeat pattern in SAS or R

Multi tool use
Count the repeat pattern in SAS or R
I have a dataset which has all the clicks done on the website in 1 column. I want to find the pattern which gets repeated in the whole data and the data contains more than 1 Million rows and has 17000 different patterns. I also want to know the average time spend on each click for each pattern. I have written a code in SAS which groups each pattern and also finds the time difference between each click but I am not getting the output how I want. Also, if a pattern contains the consecutive keyword "one" then I want it to be merged and considered as single keyword "one".
For example, according to my code, I am getting this output:
Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0
Output expected:
Clicks Average Time Count
ABCone A-7.5,B-4,C-0,one-2 2
DEF D-12,E-5,F-0 1
HIJ H-8,I-9,J-0 1
3 Answers
3
The following reproduces your expected output.
df %>%
group_by(Clicks) %>%
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Group) %>%
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>%
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABC A-7.5,B-4.0,C-0.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1
It's a fairly straightforward matter of different (re-)grouping and paste
ing entries.
paste
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0 ", header = T)
For your updated data (note there is a mistake in your expected output for the average of C
)
C
df %>%
group_by(Clicks) %>% # Do the averaging
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Clicks, Group) %>% # Deal with duplicates per Clicks+Group
slice(1) %>%
group_by(Group) %>% # Paste entries
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>% # Ungroup to prepare for counting
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABCone A-7.5,B-4.0,C-5.5,one-2.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1
And the updated data
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0 ", header = T)
Thank you the code works perfectly for me! And I am really sorry when I was running your code at that time I realize there was one more additional thing. I have edited the question so can you please help me with the edited part.
– vrushank shah
Jul 3 at 0:40
@vrushankshah I have updated my answer to include a solution for your revised data.
– Maurits Evers
Jul 3 at 1:47
You'll get more help if you post the data in a way we can copy and paste to work with it. I'm thinking that dplyr
would help here.
dplyr
Edit:
Someone edited the OP to be more parseable. I was able to get you close, but the Average Time column isn't quite what you wanted.
test %>%
group_by(Group) %>%
mutate(Click_Order = paste0(Clicks, collapse = "")) %>%
group_by(Click_Order) %>%
summarise(Average_Time = mean(Time), Count = n()/3) %>%
arrange(desc(Count))
# A tibble: 3 x 3
Click_Order Average_Time Count
<chr> <dbl> <dbl>
1 ABC 3.83 2.
2 DEF 5.67 1.
3 HIJ 5.67 1.
In SAS
Proc MEANS
CLASS
Sample code
data have; input
Clicks $ Group Time; datalines;
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0
run;
* presume no clicks value contains pipe (|) character;
data have2 / view=have2;
length pattern $30;
pattern = '|'; * prepare for bounded token search via INDEX();
do _n_ = 1 by 1 until (last.group);
set have;
by group;
* use this line if all items in group are known to be distinct ;
* pattern = cats(pattern,clicks);
* track observed clicks by searching the growing pattern of the group;
bounded_token = cats( '|', clicks, '|' );
if index (pattern, trim(bounded_token) ) = 0 then
pattern = cats (pattern, clicks, '|');
end;
if length (pattern) = lengthc(pattern) then do;
put 'WARNING: pattern needs more length';
stop;
end;
* remove token bounders;
pattern = compress(pattern,'|');
do _n_ = 1 to _n_;
set have;
output;
end;
run;
proc means noprint data=have2;
class pattern clicks;
var time;
ways 2;
output out=have_means mean=mean ;
run;
data want (keep=pattern time_summary _freq_);
do until (last.pattern);
set have_means;
by pattern;
length time_summary $100;
time_summary = catx(',',time_summary,catx('-',clicks,mean));
end;
run;
Thank you the code works perfectly for me! And I am really sorry when I was running your code at that time I realize there was one more additional thing. I have edited the question so can you please help me with the edited part.
– vrushank shah
Jul 3 at 1:16
The computation requirement for the pattern appears to have changed to concatenation of distinct clicks values. This can still be done in serial DOW loops, but would require 'clicks' value tracking in an array or hash in the first loop. Do you know apriori the largest number of distinct items in a group ?
– Richard
Jul 3 at 1:22
Review the pattern computation -- it checks for pre-observed clicks values within group by using bounded tokens that are searched for during iteration in order to ensure distinctness within pattern value
– Richard
Jul 3 at 1:39
What if I want to compute the average of whole pattern as well ? Is there any easy way to do it ? I wanted to show the average time of whole pattern and then the average time for each click. I am sorry to bug you but I keep on getting new things up
– vrushank shah
Jul 3 at 21:15
The
Proc MEANS
in the example code has WAYS 2
. WAYS
controls the variety of combinations of the class variables that are processed. Likewise, the MEANS statement TYPES
can also control the variables defining what is combined for the summary statistics computations. Try replacing WAYS 2
with TYPES pattern*clicks pattern
. Be sure to read the PROC MEANS documentation for a better understanding.– Richard
Jul 4 at 18:56
Proc MEANS
WAYS 2
WAYS
TYPES
WAYS 2
TYPES pattern*clicks pattern
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Show the SAS code you wrote
– Richard
Jul 3 at 0:13