Complex SQL query, no answer anywhere :(


Complex SQL query, no answer anywhere :(



Assume thet I have in SQL database events with properties date and user_id. I have 10 records in table event:


date


user_id


event


1. user_id=1, date=2018.04.10
2. user_id=1, date=2018.04.11
3. user_id=1, date=2018.04.13
4. user_id=1, date=2018.04.17
5. user_id=1, date=2018.04.18
6. user_id=2, date=2018.04.12
7. user_id=2, date=2018.04.12
8. user_id=2, date=2018.04.13
9. user_id=2, date=2018.04.15
10. user_id=2, date=2018.04.16



Is it possible to write query using standard SQL syntax that will show me only records for every user, that date is at least 2 days difference.
So:


1. user_id=1, date=2018.04.10 will be in result
2. user_id=1, date=2018.04.11 not in result, only 1 day difference
3. user_id=1, date=2018.04.12 will be in result, 2 days dif from record nbr 1.
4. user_id=1, date=2018.04.17 will be in result, 5 days dif from record nbr 3.
5. user_id=1, date=2018.04.18 no in result, only 1 day dif from record nbr4.
6. user_id=2, date=2018.04.12 will be in result
7. user_id=2, date=2018.04.12 not in result, 0 day difference
8. user_id=2, date=2018.04.13 not in result, only 1 day dif from record nbr. 6.
9. user_id=2, date=2018.04.15 will be in result, 2 days dif from record nbr 8.
10. user_id=2, date=2018.04.16 not in result, only 1 day dif from record nbr. 9.



Please, help guys, no one in my office cant help me with it :(
I will use this query in Google BigQuery





You must use a self join correlating instances by nbr = nbr+1
– jean
Jul 2 at 13:13






Could you JOIN the table to itself and add a WHERE DATEDIFF(DAY, table1.date, table2.date) >= 2 and see what happens?
– Captain Kenpachi
Jul 2 at 13:13





Your question is ill-defined, because you have ties in the date column. SQL tables represent unordered sets. The problem is specifying what the previous value is when there are ties.
– Gordon Linoff
Jul 2 at 13:13


date





You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
Jul 5 at 21:46


mark accepted answer


vote on answer




3 Answers
3



Just use lag():


lag()


select e.*
from (select e.*,
lag(prev_date) over (partition by user_id order by date) as prev_date
from events e
) e
where prev_date is null or
date > date_add(prev_date, interval 2 day);



Note: This may be a little tricky with the tied dates. It would be better if you added a second column to the order by so the ordering is stable.


order by



If you have only these two columns, then you can do a distinct at the lowest level to address this issue:


select e.*
from (select e.*,
lag(prev_date) over (partition by user_id order by date) as prev_date
from (select distinct e.* from events e) e
) e
where prev_date is null or
date > date_add(prev_date, interval 2 day);



As Gordon Linoff's comment stated : "SQL tables represent unordered sets" and it is crucial to realize that when writing queries to address this kind of problem.



Another way to approach the case is to step back and consider what the reason might be for not including some given row in the result set. Going by the data sample you provided, it seems to be the existence of another row (i.e. distinct from the given one) that has the same user_id and a date that is either equal or one day before the date from the given row. That translates trivially to a standard correlated subquery using WHERE NOT EXISTS. And it is much more precise than your vague "that date is at least 2 days difference", which begs the question of "difference with what ?" .


SELECT USER_ID,DATE
FROM EVENT E
WHERE NOT EXISTS (SELECT * FROM EVENT E2
WHERE E2.USER_ID = E.USER_ID
AND
<appropriate comparison here between E2.DATE and E.DATE>
AND
<appropriate comparison here to ascertain only distinct rows are processed>);



Using the window functions may address your problem correctly as well, but as Gordon Linoff has pointed out, one must be rather vigilant with how they behave over ties, and over the "first" and "last" rows of any group.



EDIT



There also seems to be an issue why you are saying for line 3. "will be in result, 2 days dif from record nbr 1." Why are you not comparing here with line 2 ??? Because line 2 has not been retained for the result set and you want the comparison to always be with the "last row retained" ??? That makes the problem/solution inherently recursive and makes both mine and Gordon's solution inapplicable.



Below is for BigQuery Standard SQL



As Erwin mentioned in EDIT on his answer - problem/solution inherently recursive and makes both mine and Gordon's solution inapplicable so below solution addresses recursiveness. Also it properly deals with your date fields by parsing it into DATE type and after all calc done - formats it back to your notation. etc.


EDIT


problem/solution inherently recursive and makes both mine and Gordon's solution inapplicable


#standardSQL
CREATE TEMPORARY FUNCTION qualified_entries(arr ARRAY<DATE>)
RETURNS ARRAY<DATE>
LANGUAGE js AS """
var result = ; prev = null; day = 1000*60*60*24;
for (i = 0; i < arr.length; i++) {
if (i == 0 || Math.round((arr[i].getTime() - prev)/day) > 2) {
result.push(arr[i]);
prev = arr[i].getTime();
}
};
return result;
""";
SELECT user_id, FORMAT_DATE('%Y.%m.%d', dt) dt FROM (
SELECT user_id, qualified_entries(ARRAY_AGG(PARSE_DATE('%Y.%m.%d', dt)))dt
FROM `project.dataset.table`
GROUP BY user_id
), UNNEST(dt) dt



You can test / play with above using dummy data from your question as below


#standardSQL
CREATE TEMPORARY FUNCTION qualified_entries(arr ARRAY<DATE>)
RETURNS ARRAY<DATE>
LANGUAGE js AS """
var result = ; prev = null; day = 1000*60*60*24;
for (i = 0; i < arr.length; i++) {
if (i == 0 || Math.round((arr[i].getTime() - prev)/day) > 2) {
result.push(arr[i]);
prev = arr[i].getTime();
}
};
return result;
""";
WITH `project.dataset.table` AS (
SELECT 1 user_id, '2018.04.10' dt UNION ALL
SELECT 1, '2018.04.11' UNION ALL
SELECT 1, '2018.04.13' UNION ALL
SELECT 1, '2018.04.17' UNION ALL
SELECT 1, '2018.04.18' UNION ALL
SELECT 2, '2018.04.12' UNION ALL
SELECT 2, '2018.04.12' UNION ALL
SELECT 2, '2018.04.13' UNION ALL
SELECT 2, '2018.04.15' UNION ALL
SELECT 2, '2018.04.16'
)
SELECT user_id, FORMAT_DATE('%Y.%m.%d', dt) dt FROM (
SELECT user_id, qualified_entries(ARRAY_AGG(PARSE_DATE('%Y.%m.%d', dt)))dt
FROM `project.dataset.table`
GROUP BY user_id
), UNNEST(dt) dt
-- ORDER BY user_id, dt



with result as


Row user_id dt
1 1 2018.04.10
2 1 2018.04.13
3 1 2018.04.17
4 2 2018.04.12
5 2 2018.04.15






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

api-platform.com Unable to generate an IRI for the item of type

How to set up datasource with Spring for HikariCP?

Display dokan vendor name on Woocommerce single product pages