diff --git a/docs/challenging-sql-problems/challenging-sql-problems.md b/docs/challenging-sql-problems/challenging-sql-problems.md index 93b571c..f1993e7 100644 --- a/docs/challenging-sql-problems/challenging-sql-problems.md +++ b/docs/challenging-sql-problems/challenging-sql-problems.md @@ -36,15 +36,15 @@ These require a bit more thinking. 5. [Region precipitation](problems/silver/region-precipitation.md) 6. [Predicting values](problems/silver/predicting-values.md) 7. [Mandelbrot set](problems/silver/mandelbrot-set.md) -8. [Encoding datelist ints](problems/silver/encoding-datelist-ints.md) -9. [Customer sales running totals](problems/silver/customer-sales-running-totals.md) +8. [Customer sales running totals](problems/silver/customer-sales-running-totals.md) ### 🟡 Gold Tier Expect to spend a bit of time on these. -1. [Loan repayment schedules](problems/gold/loan-repayment-schedule.md) -2. [Travel plans](problems/gold/travel-plans.md) +1. [Encoding datelist ints](problems/gold/encoding-datelist-ints.md) +2. [Loan repayment schedules](problems/gold/loan-repayment-schedule.md) +3. [Travel plans](problems/gold/travel-plans.md) --- diff --git a/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-input.sql b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-input.sql new file mode 100644 index 0000000..4c6f423 --- /dev/null +++ b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-input.sql @@ -0,0 +1,17 @@ +```sql +with events(event_id, user_id, event_datetime, event_type) as ( + values + (1, 1, '2024-01-01 01:03:00'::timestamp, 'login'), + (2, 1, '2024-01-04 01:02:00'::timestamp, 'login'), + (3, 1, '2024-01-05 01:01:00'::timestamp, 'login'), + (4, 1, '2024-01-06 01:00:00'::timestamp, 'logout'), + (5, 1, '2024-01-07 01:05:00'::timestamp, 'logout'), + (6, 1, '2024-01-07 01:06:00'::timestamp, 'logout'), + (7, 2, '2024-01-08 01:07:00'::timestamp, 'login'), + (8, 2, '2024-01-09 01:08:00'::timestamp, 'login'), + (9, 2, '2024-01-10 01:09:00'::timestamp, 'login'), + (10, 2, '2024-01-10 01:10:00'::timestamp, 'logout'), + (11, 2, '2024-01-11 01:11:00'::timestamp, 'logout'), + (12, 2, '2024-01-12 01:12:00'::timestamp, 'logout') +) +``` diff --git a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-output.sql b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-output.sql similarity index 50% rename from docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-output.sql rename to docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-output.sql index 89a0745..0f03b9b 100644 --- a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-output.sql +++ b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-output.sql @@ -1,7 +1,7 @@ ```sql solution(user_id, last_update, activity_history) as ( values - (1, '2024-01-09'::date, 496), - (2, '2024-01-09'::date, 7) + (1, '2024-01-12'::date, 3520), + (2, '2024-01-12'::date, 28) ) ``` diff --git a/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints.md b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints.md new file mode 100644 index 0000000..7030c20 --- /dev/null +++ b/docs/challenging-sql-problems/problems/gold/encoding-datelist-ints.md @@ -0,0 +1,149 @@ +# Encoding datelist ints 🔐 + +> [!SUCCESS] Scenario +> +> Inspired by the modelling approach used by the social media platform from the [customer churn](../bronze/customer-churn.md)/[decoding datelist ints](../silver/decoding-datelist-ints.md) problems, the company from the [suspicious login activity](../bronze/suspicious-login-activity.md)/[bannable login activity](../silver/bannable-login-activity.md) problems want to implement the "datelist integer" for their users' login history. +> +> This company has also given us more information about how their system works: logins expire after 24 hours, and only _manual_ user logouts are recorded in the events table. +> +> This means that: +> +> - A consecutive login within 24 hours (inclusive) of a previous login keeps the user logged in. +> - A logout event when the user is already logged out does nothing; it can be ignored. + +> [!QUESTION] +> +> Encode the login history for each user into a datelist integer, where a day is flagged as `1` if the user was logged in at any point on that day, `0` otherwise. +> +> The encoding should be relative to the day of the latest event in the table. +> +> The output should have a single row per user with the columns: +> +> - `user_id` +> - `last_update` as the date of the latest event (the date that the encoding is relative to) +> - `activity_history` as the datelist integer +> +> Order the output by `user_id`. + +
+Expand for the DDL +--8<-- "docs/challenging-sql-problems/problems/bronze/suspicious-login-activity.sql" +
+ +The solution can be found at: + +- [encoding-datelist-ints.md](../../solutions/gold/encoding-datelist-ints.md) + +A worked example is provided below to help illustrate the encoding. + +--- + + +>? INFO: **Sample input** +> +> | event_id | user_id | event_datetime | event_type | +> |---------:|--------:|:--------------------|:-----------| +> | 1 | 1 | 2024-01-01 01:03:00 | login | +> | 2 | 1 | 2024-01-04 01:02:00 | login | +> | 3 | 1 | 2024-01-05 01:01:00 | login | +> | 4 | 1 | 2024-01-06 01:00:00 | logout | +> | 5 | 1 | 2024-01-07 01:05:00 | logout | +> | 6 | 1 | 2024-01-07 01:06:00 | logout | +> | 7 | 2 | 2024-01-08 01:07:00 | login | +> | 8 | 2 | 2024-01-09 01:08:00 | login | +> | 9 | 2 | 2024-01-10 01:09:00 | login | +> | 10 | 2 | 2024-01-10 01:10:00 | logout | +> | 11 | 2 | 2024-01-11 01:11:00 | logout | +> | 12 | 2 | 2024-01-12 01:12:00 | logout | +> +--8<-- "docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-input.sql" + + +>? INFO: **Sample output** +> +> | user_id | last_update | activity_history | +> |--------:|:------------|-----------------:| +> | 1 | 2024-01-12 | 3520 | +> | 2 | 2024-01-12 | 28 | +> +--8<-- "docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-output.sql" + + +>? TIP: **Hint 1** +> +> First construct a table with one row per day per user, and a column which flags if the user was logged in at any point on that day. + + +>? TIP: **Hint 2** +> +> Multiply the flag column by a power of 2 to get the value component for that day, then sum them up to get the datelist integer. +> +> Note that the power of 2 should be the number of days between the current row's day and the latest event's day. + +--- + +### Worked example + +To help illustrate the encoding, consider the following events: + +| event_id | user_id | event_datetime | event_type | +| -------: | ------: | :------------------ | :--------- | +| 1 | 1 | 2024-01-01 01:03:00 | login | +| 2 | 1 | 2024-01-04 01:02:00 | login | +| 3 | 1 | 2024-01-05 01:01:00 | login | +| 4 | 1 | 2024-01-06 01:00:00 | logout | +| 5 | 1 | 2024-01-07 01:05:00 | logout | +| 6 | 1 | 2024-01-07 01:06:00 | logout | +| 7 | 2 | 2024-01-08 01:07:00 | login | +| 8 | 2 | 2024-01-09 01:08:00 | login | +| 9 | 2 | 2024-01-10 01:09:00 | login | +| 10 | 2 | 2024-01-10 01:10:00 | logout | +| 11 | 2 | 2024-01-11 01:11:00 | logout | +| 12 | 2 | 2024-01-12 01:12:00 | logout | + +We'll walk through each of the events and how they contribute to different sessions. + +1. User 1 logs in at `2024-01-01 01:03:00`, starting the first session. Their next event is a login at `2024-01-04 01:02:00` which is over 24 hours; so the first session expires at `2024-01-02 01:03:00`. +2. User 1 logs in at `2024-01-04 01:02:00`, starting the second session. Their next event is a login at `2024-01-05 01:01:00` which is under 24 hours; so the second session continues. +3. User 1 logs in at `2024-01-05 01:01:00`, continuing the second session. Their next event is a logout at `2024-01-06 01:00:00` which is under 24 hours; so the second session ends at `2024-01-06 01:00:00`. +4. User 1 logs out at `2024-01-06 01:00:00`, which we've already accounted for. +5. User 1 logs out at `2024-01-07 01:05:00`, which does nothing. +6. User 1 logs out at `2024-01-07 01:06:00`, which does nothing. +7. User 2 logs in at `2024-01-08 01:07:00`, starting the third session. Their next event is a login at `2024-01-09 01:08:00` which is under 24 hours; so the third session continues. +8. User 2 logs in at `2024-01-09 01:08:00`, continuing the third session. Their next event is a login at `2024-01-10 01:09:00` which is under 24 hours; so the third session continues. +9. User 2 logs in at `2024-01-10 01:09:00`, continuing the third session. Their next event is a logout at `2024-01-10 01:10:00` which is under 24 hours; so the third session ends at `2024-01-10 01:10:00`. +10. User 2 logs out at `2024-01-10 01:10:00`, which we've already accounted for. +11. User 2 logs out at `2024-01-11 01:11:00`, which does nothing. +12. User 2 logs out at `2024-01-12 01:12:00`, which does nothing. + +This gives us the following sessions: + +| user_id | login_date | logout_date | +| :------ | :------------------ | :------------------ | +| 1 | 2024-01-01 01:03:00 | 2024-01-02 01:03:00 | +| 1 | 2024-01-04 01:02:00 | 2024-01-06 01:00:00 | +| 2 | 2024-01-08 01:07:00 | 2024-01-09 01:07:00 | +| 2 | 2024-01-09 01:08:00 | 2024-01-10 01:08:00 | +| 2 | 2024-01-10 01:09:00 | 2024-01-10 01:10:00 | + +The earliest and latest events in the `events` table are on `2024-01-01` and `2024-01-12` respectively, so the encoding is relative to the `2024-01-12`. We can plot the sessions on a timeline: + +| active_date | user_1_is_active | user_2_is_active | +| :---------- | ---------------: | ---------------: | +| 2024-01-01 | 1 | 0 | +| 2024-01-02 | 1 | 0 | +| 2024-01-03 | 0 | 0 | +| 2024-01-04 | 1 | 0 | +| 2024-01-05 | 1 | 0 | +| 2024-01-06 | 1 | 0 | +| 2024-01-07 | 0 | 0 | +| 2024-01-08 | 0 | 1 | +| 2024-01-09 | 0 | 1 | +| 2024-01-10 | 0 | 1 | +| 2024-01-11 | 0 | 0 | +| 2024-01-12 | 0 | 0 | + +Since the latest event is on `2024-01-12`, the encoding for `2024-01-12` uses 20, the encoding for `2024-01-11` uses 21, and so on. This gives us the following datelist integers: + +- User 1: 2048 + 1024 + 256 + 128 + 64 = 3520 +- User 2: 16 + 8 + 4 = 28 diff --git a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-input.sql b/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-input.sql deleted file mode 100644 index 33d30d6..0000000 --- a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-input.sql +++ /dev/null @@ -1,17 +0,0 @@ -```sql -with events(event_id, user_id, event_datetime, event_type) as ( - values - (1, 1, '2024-01-01 01:03:00'::timestamp, 'login'), - (2, 1, '2024-01-03 01:02:00'::timestamp, 'login'), - (3, 1, '2024-01-04 01:01:00'::timestamp, 'login'), - (4, 1, '2024-01-05 01:00:00'::timestamp, 'logout'), - (5, 1, '2024-01-06 01:05:00'::timestamp, 'logout'), - (6, 1, '2024-01-06 01:06:00'::timestamp, 'logout'), - (7, 2, '2024-01-07 01:07:00'::timestamp, 'login'), - (8, 2, '2024-01-08 01:08:00'::timestamp, 'login'), - (9, 2, '2024-01-09 01:09:00'::timestamp, 'login'), - (10, 2, '2024-01-09 01:10:00'::timestamp, 'logout'), - (11, 2, '2024-01-10 01:11:00'::timestamp, 'logout'), - (12, 2, '2024-01-11 01:12:00'::timestamp, 'logout') -) -``` diff --git a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints.md b/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints.md deleted file mode 100644 index 0bb39ba..0000000 --- a/docs/challenging-sql-problems/problems/silver/encoding-datelist-ints.md +++ /dev/null @@ -1,79 +0,0 @@ -# Encoding datelist ints 🔐 - -> [!SUCCESS] Scenario -> -> Inspired by the modelling approach used by the social media platform from the [customer churn](../bronze/customer-churn.md)/[decoding datelist ints](decoding-datelist-ints.md) problems, the company from the [suspicious login activity](../bronze/suspicious-login-activity.md)/[bannable login activity](bannable-login-activity.md) problems want to implement the "datelist integer" for their users' login history. -> -> This company has also given us more information about how their system works: logins expire after 24 hours, and only _manual_ user logouts are recorded in the events table. -> -> This means that: -> -> - A consecutive login within 24 hours (inclusive) of a previous login keeps the user logged in. -> - A logout event when the user is already logged out does nothing; it can be ignored. - -> [!QUESTION] -> -> Encode the login history for each user into a datelist integer, where a day is flagged as `1` if the user was logged in at any point on that day, `0` otherwise. -> -> The encoding should be relative to the day of the latest event in the table. -> -> The output should have a single row per user with the columns: -> -> - `user_id` -> - `last_update` as the date of the latest event (the date that the encoding is relative to) -> - `activity_history` as the datelist integer -> -> Order the output by `user_id`. - -
-Expand for the DDL ---8<-- "docs/challenging-sql-problems/problems/bronze/suspicious-login-activity.sql" -
- -The solution can be found at: - -- [encoding-datelist-ints.md](../../solutions/silver/encoding-datelist-ints.md) - ---- - - ->? INFO: **Sample input** -> -> | event_id | user_id | event_datetime | event_type | -> |---------:|--------:|:--------------------|:-----------| -> | 1 | 1 | 2024-01-01 01:03:00 | login | -> | 2 | 1 | 2024-01-03 01:02:00 | login | -> | 3 | 1 | 2024-01-04 01:01:00 | login | -> | 4 | 1 | 2024-01-05 01:00:00 | logout | -> | 5 | 1 | 2024-01-06 01:05:00 | logout | -> | 6 | 1 | 2024-01-06 01:06:00 | logout | -> | 7 | 2 | 2024-01-07 01:07:00 | login | -> | 8 | 2 | 2024-01-08 01:08:00 | login | -> | 9 | 2 | 2024-01-09 01:09:00 | login | -> | 10 | 2 | 2024-01-09 01:10:00 | logout | -> | 11 | 2 | 2024-01-10 01:11:00 | logout | -> | 12 | 2 | 2024-01-11 01:12:00 | logout | -> ---8<-- "docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-input.sql" - - ->? INFO: **Sample output** -> -> | user_id | last_update | activity_history | -> |--------:|:------------|-----------------:| -> | 1 | 2024-01-09 | 496 | -> | 2 | 2024-01-09 | 7 | -> ---8<-- "docs/challenging-sql-problems/problems/silver/encoding-datelist-ints--sample-output.sql" - - ->? TIP: **Hint 1** -> -> First construct a table with one row per day per user, and a column which flags if the user was logged in at any point on that day. - - ->? TIP: **Hint 2** -> -> Multiply the flag column by a power of 2 to get the value component for that day, then sum them up to get the datelist integer. -> -> Note that the power of 2 should be the number of days between the current row's day and the latest event's day. diff --git a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints--duckdb.sql b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints--duckdb.sql similarity index 58% rename from docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints--duckdb.sql rename to docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints--duckdb.sql index f213788..104017a 100644 --- a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints--duckdb.sql +++ b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints--duckdb.sql @@ -1,30 +1,40 @@ ```sql -with recursive +with -event_groups as ( +session_datetimes as ( select event_id, user_id, - event_type, event_datetime as login_datetime, - (0 - + row_number() over (partition by user_id order by event_datetime) - - row_number() over (partition by user_id, event_type order by event_datetime) - ) as event_group, coalesce( ( select min(event_datetime) from events as innr where 1=1 - and events.event_type = 'login' - and innr.event_type = 'logout' and events.user_id = innr.user_id - and events.event_datetime < innr.event_datetime + and innr.event_datetime > events.event_datetime and innr.event_datetime <= events.event_datetime + interval '1 day' + and innr.event_type = 'logout' ), events.event_datetime + interval '1 day' - ) as logout_datetime + ) as logout_datetime, from events + where event_type = 'login' +), + +event_groups as ( + select + *, + sum(is_new_session::int) over (order by login_datetime) as session_id + from ( + select + *, + login_datetime >= lag(logout_datetime, 1, login_datetime) over ( + partition by user_id + order by login_datetime + ) as is_new_session + from session_datetimes + ) ), sessions as ( @@ -33,14 +43,13 @@ sessions as ( min(login_datetime)::date as login_date, max(logout_datetime)::date as logout_date from event_groups - where event_type = 'login' - group by user_id, event_group + group by user_id, session_id ), dates(active_date) as ( select unnest(generate_series( - (select min(login_date) from sessions), - (select max(logout_date) from sessions), + (select min(event_datetime)::date from events), + (select max(event_datetime)::date from events), interval '1 day' )) ), @@ -49,17 +58,20 @@ activity(user_id, active_date, is_active) as ( select users.user_id, dates.active_date::date as active_date, - if(sessions.user_id is null, 0, 1) as is_active, + exists( + select * + from sessions + where 1=1 + and users.user_id = sessions.user_id + and dates.active_date between sessions.login_date + and sessions.logout_date + )::int as is_active, row_number() over ( partition by users.user_id order by dates.active_date desc ) as step from (select distinct user_id from sessions) as users cross join dates - left join sessions - on users.user_id = sessions.user_id - and dates.active_date between sessions.login_date - and sessions.logout_date ) select diff --git a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.md b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.md similarity index 52% rename from docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.md rename to docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.md index 42c5cab..03282aa 100644 --- a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.md +++ b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.md @@ -4,7 +4,7 @@ > > Solution to the following problem: > -> - [encoding-datelist-ints.md](../../problems/silver/encoding-datelist-ints.md) +> - [encoding-datelist-ints.md](../../problems/gold/encoding-datelist-ints.md) ## Result Set @@ -12,13 +12,13 @@ Regardless of the database, the result set should look like: | user_id | last_update | activity_history | | ------: | :---------- | ---------------: | -| 1 | 2024-01-26 | 41943064 | -| 2 | 2024-01-26 | 14694414 | -| 3 | 2024-01-26 | 3210303 | +| 1 | 2024-02-01 | 2684356096 | +| 2 | 2024-02-01 | 940442496 | +| 3 | 2024-02-01 | 204672192 |
Expand for the DDL ---8<-- "docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.sql" +--8<-- "docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.sql"
## Solution @@ -28,4 +28,4 @@ Some SQL solutions per database are provided below. > SUCCESS: **DuckDB** > ---8<-- "docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints--duckdb.sql" +--8<-- "docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints--duckdb.sql" diff --git a/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.sql b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.sql new file mode 100644 index 0000000..58b7230 --- /dev/null +++ b/docs/challenging-sql-problems/solutions/gold/encoding-datelist-ints.sql @@ -0,0 +1,8 @@ +```sql +solution(user_id, last_update, activity_history) as ( + values + (1, '2024-02-01'::date, 2684356096), + (2, '2024-02-01'::date, 940442496), + (3, '2024-02-01'::date, 204672192) +) +``` diff --git a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.sql b/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.sql deleted file mode 100644 index 83be240..0000000 --- a/docs/challenging-sql-problems/solutions/silver/encoding-datelist-ints.sql +++ /dev/null @@ -1,8 +0,0 @@ -```sql -solution(user_id, last_update, activity_history) as ( - values - (1, '2024-01-26'::date, 41943064), - (2, '2024-01-26'::date, 14694414), - (3, '2024-01-26'::date, 3210303) -) -``` diff --git a/mkdocs.yml b/mkdocs.yml index c65dffd..9e1866e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -92,9 +92,9 @@ nav: - challenging-sql-problems/problems/silver/region-precipitation.md # unpivot and rollup - challenging-sql-problems/problems/silver/predicting-values.md # unpivot, regr, and pivot - challenging-sql-problems/problems/silver/mandelbrot-set.md # recursive CTE - - challenging-sql-problems/problems/silver/encoding-datelist-ints.md # window functions (gaps and islands), correlated subquery, custom axis - challenging-sql-problems/problems/silver/customer-sales-running-totals.md # window functions - 🟡 Gold Tier: + - challenging-sql-problems/problems/gold/encoding-datelist-ints.md # window functions (gaps and islands), correlated subquery, custom axis - challenging-sql-problems/problems/gold/loan-repayment-schedule.md # recursive CTE - challenging-sql-problems/problems/gold/travel-plans.md # recursive CTEs - Solutions: @@ -115,9 +115,9 @@ nav: - challenging-sql-problems/solutions/silver/region-precipitation.md - challenging-sql-problems/solutions/silver/predicting-values.md - challenging-sql-problems/solutions/silver/mandelbrot-set.md - - challenging-sql-problems/solutions/silver/encoding-datelist-ints.md - challenging-sql-problems/solutions/silver/customer-sales-running-totals.md - 🟡 Gold Tier: + - challenging-sql-problems/solutions/gold/encoding-datelist-ints.md - challenging-sql-problems/solutions/gold/loan-repayment-schedule.md - challenging-sql-problems/solutions/gold/travel-plans.md