Skip to content

Commit

Permalink
Merge pull request #29
Browse files Browse the repository at this point in the history
Improve datelist encoding solution
  • Loading branch information
Bilbottom authored Jun 25, 2024
2 parents 18a6ec0 + ca16556 commit 68013a4
Show file tree
Hide file tree
Showing 11 changed files with 220 additions and 138 deletions.
8 changes: 4 additions & 4 deletions docs/challenging-sql-problems/challenging-sql-problems.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ These require a bit more thinking.
5. [Region precipitation](problems/silver/region-precipitation.md)
6. [Predicting values](problems/silver/predicting-values.md)
7. [Mandelbrot set](problems/silver/mandelbrot-set.md)
8. [Encoding datelist ints](problems/silver/encoding-datelist-ints.md)
9. [Customer sales running totals](problems/silver/customer-sales-running-totals.md)
8. [Customer sales running totals](problems/silver/customer-sales-running-totals.md)

### 🟡 Gold Tier

Expect to spend a bit of time on these.

1. [Loan repayment schedules](problems/gold/loan-repayment-schedule.md)
2. [Travel plans](problems/gold/travel-plans.md)
1. [Encoding datelist ints](problems/gold/encoding-datelist-ints.md)
2. [Loan repayment schedules](problems/gold/loan-repayment-schedule.md)
3. [Travel plans](problems/gold/travel-plans.md)

---

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
```sql
with events(event_id, user_id, event_datetime, event_type) as (
values
(1, 1, '2024-01-01 01:03:00'::timestamp, 'login'),
(2, 1, '2024-01-04 01:02:00'::timestamp, 'login'),
(3, 1, '2024-01-05 01:01:00'::timestamp, 'login'),
(4, 1, '2024-01-06 01:00:00'::timestamp, 'logout'),
(5, 1, '2024-01-07 01:05:00'::timestamp, 'logout'),
(6, 1, '2024-01-07 01:06:00'::timestamp, 'logout'),
(7, 2, '2024-01-08 01:07:00'::timestamp, 'login'),
(8, 2, '2024-01-09 01:08:00'::timestamp, 'login'),
(9, 2, '2024-01-10 01:09:00'::timestamp, 'login'),
(10, 2, '2024-01-10 01:10:00'::timestamp, 'logout'),
(11, 2, '2024-01-11 01:11:00'::timestamp, 'logout'),
(12, 2, '2024-01-12 01:12:00'::timestamp, 'logout')
)
```
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
```sql
solution(user_id, last_update, activity_history) as (
values
(1, '2024-01-09'::date, 496),
(2, '2024-01-09'::date, 7)
(1, '2024-01-12'::date, 3520),
(2, '2024-01-12'::date, 28)
)
```
149 changes: 149 additions & 0 deletions docs/challenging-sql-problems/problems/gold/encoding-datelist-ints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Encoding datelist ints 🔐

> [!SUCCESS] Scenario
>
> Inspired by the modelling approach used by the social media platform from the [customer churn](../bronze/customer-churn.md)/[decoding datelist ints](../silver/decoding-datelist-ints.md) problems, the company from the [suspicious login activity](../bronze/suspicious-login-activity.md)/[bannable login activity](../silver/bannable-login-activity.md) problems want to implement the "datelist integer" for their users' login history.
>
> This company has also given us more information about how their system works: logins expire after 24 hours, and only _manual_ user logouts are recorded in the events table.
>
> This means that:
>
> - A consecutive login within 24 hours (inclusive) of a previous login keeps the user logged in.
> - A logout event when the user is already logged out does nothing; it can be ignored.
> [!QUESTION]
>
> Encode the login history for each user into a datelist integer, where a day is flagged as `1` if the user was logged in at any point on that day, `0` otherwise.
>
> The encoding should be relative to the day of the latest event in the table.
>
> The output should have a single row per user with the columns:
>
> - `user_id`
> - `last_update` as the date of the latest event (the date that the encoding is relative to)
> - `activity_history` as the datelist integer
>
> Order the output by `user_id`.
<details>
<summary>Expand for the DDL</summary>
--8<-- "docs/challenging-sql-problems/problems/bronze/suspicious-login-activity.sql"
</details>

The solution can be found at:

- [encoding-datelist-ints.md](../../solutions/gold/encoding-datelist-ints.md)

A worked example is provided below to help illustrate the encoding.

---

<!-- prettier-ignore -->
>? INFO: **Sample input**
>
> | event_id | user_id | event_datetime | event_type |
> |---------:|--------:|:--------------------|:-----------|
> | 1 | 1 | 2024-01-01 01:03:00 | login |
> | 2 | 1 | 2024-01-04 01:02:00 | login |
> | 3 | 1 | 2024-01-05 01:01:00 | login |
> | 4 | 1 | 2024-01-06 01:00:00 | logout |
> | 5 | 1 | 2024-01-07 01:05:00 | logout |
> | 6 | 1 | 2024-01-07 01:06:00 | logout |
> | 7 | 2 | 2024-01-08 01:07:00 | login |
> | 8 | 2 | 2024-01-09 01:08:00 | login |
> | 9 | 2 | 2024-01-10 01:09:00 | login |
> | 10 | 2 | 2024-01-10 01:10:00 | logout |
> | 11 | 2 | 2024-01-11 01:11:00 | logout |
> | 12 | 2 | 2024-01-12 01:12:00 | logout |
>
--8<-- "docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-input.sql"

<!-- prettier-ignore -->
>? INFO: **Sample output**
>
> | user_id | last_update | activity_history |
> |--------:|:------------|-----------------:|
> | 1 | 2024-01-12 | 3520 |
> | 2 | 2024-01-12 | 28 |
>
--8<-- "docs/challenging-sql-problems/problems/gold/encoding-datelist-ints--sample-output.sql"

<!-- prettier-ignore -->
>? TIP: **Hint 1**
>
> First construct a table with one row per day per user, and a column which flags if the user was logged in at any point on that day.
<!-- prettier-ignore -->
>? TIP: **Hint 2**
>
> Multiply the flag column by a power of 2 to get the value component for that day, then sum them up to get the datelist integer.
>
> Note that the power of 2 should be the number of days between the current row's day and the latest event's day.
---

### Worked example

To help illustrate the encoding, consider the following events:

| event_id | user_id | event_datetime | event_type |
| -------: | ------: | :------------------ | :--------- |
| 1 | 1 | 2024-01-01 01:03:00 | login |
| 2 | 1 | 2024-01-04 01:02:00 | login |
| 3 | 1 | 2024-01-05 01:01:00 | login |
| 4 | 1 | 2024-01-06 01:00:00 | logout |
| 5 | 1 | 2024-01-07 01:05:00 | logout |
| 6 | 1 | 2024-01-07 01:06:00 | logout |
| 7 | 2 | 2024-01-08 01:07:00 | login |
| 8 | 2 | 2024-01-09 01:08:00 | login |
| 9 | 2 | 2024-01-10 01:09:00 | login |
| 10 | 2 | 2024-01-10 01:10:00 | logout |
| 11 | 2 | 2024-01-11 01:11:00 | logout |
| 12 | 2 | 2024-01-12 01:12:00 | logout |

We'll walk through each of the events and how they contribute to different sessions.

1. User 1 logs in at `2024-01-01 01:03:00`, starting the first session. Their next event is a login at `2024-01-04 01:02:00` which is over 24 hours; so the first session expires at `2024-01-02 01:03:00`.
2. User 1 logs in at `2024-01-04 01:02:00`, starting the second session. Their next event is a login at `2024-01-05 01:01:00` which is under 24 hours; so the second session continues.
3. User 1 logs in at `2024-01-05 01:01:00`, continuing the second session. Their next event is a logout at `2024-01-06 01:00:00` which is under 24 hours; so the second session ends at `2024-01-06 01:00:00`.
4. User 1 logs out at `2024-01-06 01:00:00`, which we've already accounted for.
5. User 1 logs out at `2024-01-07 01:05:00`, which does nothing.
6. User 1 logs out at `2024-01-07 01:06:00`, which does nothing.
7. User 2 logs in at `2024-01-08 01:07:00`, starting the third session. Their next event is a login at `2024-01-09 01:08:00` which is under 24 hours; so the third session continues.
8. User 2 logs in at `2024-01-09 01:08:00`, continuing the third session. Their next event is a login at `2024-01-10 01:09:00` which is under 24 hours; so the third session continues.
9. User 2 logs in at `2024-01-10 01:09:00`, continuing the third session. Their next event is a logout at `2024-01-10 01:10:00` which is under 24 hours; so the third session ends at `2024-01-10 01:10:00`.
10. User 2 logs out at `2024-01-10 01:10:00`, which we've already accounted for.
11. User 2 logs out at `2024-01-11 01:11:00`, which does nothing.
12. User 2 logs out at `2024-01-12 01:12:00`, which does nothing.

This gives us the following sessions:

| user_id | login_date | logout_date |
| :------ | :------------------ | :------------------ |
| 1 | 2024-01-01 01:03:00 | 2024-01-02 01:03:00 |
| 1 | 2024-01-04 01:02:00 | 2024-01-06 01:00:00 |
| 2 | 2024-01-08 01:07:00 | 2024-01-09 01:07:00 |
| 2 | 2024-01-09 01:08:00 | 2024-01-10 01:08:00 |
| 2 | 2024-01-10 01:09:00 | 2024-01-10 01:10:00 |

The earliest and latest events in the `events` table are on `2024-01-01` and `2024-01-12` respectively, so the encoding is relative to the `2024-01-12`. We can plot the sessions on a timeline:

| active_date | user_1_is_active | user_2_is_active |
| :---------- | ---------------: | ---------------: |
| 2024-01-01 | 1 | 0 |
| 2024-01-02 | 1 | 0 |
| 2024-01-03 | 0 | 0 |
| 2024-01-04 | 1 | 0 |
| 2024-01-05 | 1 | 0 |
| 2024-01-06 | 1 | 0 |
| 2024-01-07 | 0 | 0 |
| 2024-01-08 | 0 | 1 |
| 2024-01-09 | 0 | 1 |
| 2024-01-10 | 0 | 1 |
| 2024-01-11 | 0 | 0 |
| 2024-01-12 | 0 | 0 |

Since the latest event is on `2024-01-12`, the encoding for `2024-01-12` uses 2<sup>0</sup>, the encoding for `2024-01-11` uses 2<sup>1</sup>, and so on. This gives us the following datelist integers:

- User 1: 2048 + 1024 + 256 + 128 + 64 = 3520
- User 2: 16 + 8 + 4 = 28

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,30 +1,40 @@
```sql
with recursive
with
event_groups as (
session_datetimes as (
select
event_id,
user_id,
event_type,
event_datetime as login_datetime,
(0
+ row_number() over (partition by user_id order by event_datetime)
- row_number() over (partition by user_id, event_type order by event_datetime)
) as event_group,
coalesce(
(
select min(event_datetime)
from events as innr
where 1=1
and events.event_type = 'login'
and innr.event_type = 'logout'
and events.user_id = innr.user_id
and events.event_datetime < innr.event_datetime
and innr.event_datetime > events.event_datetime
and innr.event_datetime <= events.event_datetime + interval '1 day'
and innr.event_type = 'logout'
),
events.event_datetime + interval '1 day'
) as logout_datetime
) as logout_datetime,
from events
where event_type = 'login'
),
event_groups as (
select
*,
sum(is_new_session::int) over (order by login_datetime) as session_id
from (
select
*,
login_datetime >= lag(logout_datetime, 1, login_datetime) over (
partition by user_id
order by login_datetime
) as is_new_session
from session_datetimes
)
),
sessions as (
Expand All @@ -33,14 +43,13 @@ sessions as (
min(login_datetime)::date as login_date,
max(logout_datetime)::date as logout_date
from event_groups
where event_type = 'login'
group by user_id, event_group
group by user_id, session_id
),
dates(active_date) as (
select unnest(generate_series(
(select min(login_date) from sessions),
(select max(logout_date) from sessions),
(select min(event_datetime)::date from events),
(select max(event_datetime)::date from events),
interval '1 day'
))
),
Expand All @@ -49,17 +58,20 @@ activity(user_id, active_date, is_active) as (
select
users.user_id,
dates.active_date::date as active_date,
if(sessions.user_id is null, 0, 1) as is_active,
exists(
select *
from sessions
where 1=1
and users.user_id = sessions.user_id
and dates.active_date between sessions.login_date
and sessions.logout_date
)::int as is_active,
row_number() over (
partition by users.user_id
order by dates.active_date desc
) as step
from (select distinct user_id from sessions) as users
cross join dates
left join sessions
on users.user_id = sessions.user_id
and dates.active_date between sessions.login_date
and sessions.logout_date
)
select
Expand Down
Loading

0 comments on commit 68013a4

Please sign in to comment.