Merge pull request #20

Refine SQL questions and add one more
Bilbottom · Jun 10, 2024 · 56d8447 · 56d8447
2 parents 65eedfb + c9c5790
commit 56d8447
Show file tree

Hide file tree

Showing 14 changed files with 273 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -6,9 +6,9 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/Bilbottom/sql-learning-materials)](https://shields.io/badges/git-hub-last-commit-by-committer)
 
 [![SQL Server](https://img.shields.io/badge/SQL%20Server-2022-teal.svg)](https://www.microsoft.com/en-gb/sql-server/sql-server-downloads)
-[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-15.4-teal.svg)](https://www.postgresql.org/download/)
-[![SQLite](https://img.shields.io/badge/SQLite-3.43-teal.svg)](https://www.sqlite.org/index.html)
-[![DuckDB](https://img.shields.io/badge/DuckDB-0.9-teal.svg)](https://duckdb.org/)
+[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-16.2-teal.svg)](https://www.postgresql.org/download/)
+[![SQLite](https://img.shields.io/badge/SQLite-3.45-teal.svg)](https://www.sqlite.org/index.html)
+[![DuckDB](https://img.shields.io/badge/DuckDB-1.0-teal.svg)](https://duckdb.org/)
 [![Metabase](https://img.shields.io/badge/Metabase-0.47-teal.svg)](https://www.metabase.com/)
 
 </div>
@@ -21,7 +21,7 @@ SQL scripts that demonstrate various features and concepts.
 
 This project contains a bunch of SQL learning materials aimed at different levels of experience and covering a variety of topics. It focuses on just writing `SELECT` statements so there will be very few resources for anything else.
 
-Jump into [`docs/index.md`](docs/index.md) to see the summary of what's covered in this project, and continue below for instructions on how to set up the databases.
+Jump into [https://bilbottom.github.io/sql-learning-materials/](https://bilbottom.github.io/sql-learning-materials/) to see the summary of what's covered in this project, and continue below for instructions on how to set up the databases.
 
 ## Acknowledgements
 

diff --git a/docs/challenging-sql-problems/challenging-sql-problems.md b/docs/challenging-sql-problems/challenging-sql-problems.md
@@ -4,6 +4,12 @@
 >
 > These questions are not for people new to SQL! These expect you to use advanced SQL techniques that most people don't know.
 
+> [!NOTE]
+>
+> The database versions used for the solutions can be found at the top of the following page:
+>
+> - https://github.com/Bilbottom/sql-learning-materials/blob/main/README.md
+
 ## Problems
 
 ### 🟤 Bronze Tier
@@ -24,7 +30,8 @@ These require a bit more thinking.
 2. [Bannable login activity](problems/silver/bannable-login-activity.md)
 3. [Bus routes](problems/silver/bus-routes.md)
 4. [Region precipitation](problems/silver/region-precipitation.md)
-5. [Customer sales running totals](problems/silver/customer-sales-running-totals.md)
+5. [Predicting values](problems/silver/predicting-values.md)
+6. [Customer sales running totals](problems/silver/customer-sales-running-totals.md)
 
 ### 🟡 Gold Tier
 

diff --git a/docs/challenging-sql-problems/problems/silver/funnel-analytics.md b/docs/challenging-sql-problems/problems/silver/funnel-analytics.md
@@ -47,17 +47,17 @@ The solution can be found at:
 <!-- prettier-ignore -->
 >? INFO: **Sample output**
 >
-| cohort  | stage                | mortgages | step_rate | total_rate |
-|:--------|:---------------------|----------:|----------:|-----------:|
-| 2024-01 | full application     |         4 |    100.00 |     100.00 |
-| 2024-01 | decision             |         4 |    100.00 |     100.00 |
-| 2024-01 | documentation        |         3 |     75.00 |      75.00 |
-| 2024-01 | valuation inspection |         3 |    100.00 |      75.00 |
-| 2024-01 | valuation made       |         3 |    100.00 |      75.00 |
-| 2024-01 | valuation submitted  |         3 |    100.00 |      75.00 |
-| 2024-01 | solicitation         |         1 |     33.33 |      25.00 |
-| 2024-01 | funds released       |         1 |    100.00 |      25.00 |
-| ...     | ...                  |       ... |       ... |        ... |
+> | cohort  | stage                | mortgages | step_rate | total_rate |
+> |:--------|:---------------------|----------:|----------:|-----------:|
+> | 2024-01 | full application     |         4 |    100.00 |     100.00 |
+> | 2024-01 | decision             |         4 |    100.00 |     100.00 |
+> | 2024-01 | documentation        |         3 |     75.00 |      75.00 |
+> | 2024-01 | valuation inspection |         3 |    100.00 |      75.00 |
+> | 2024-01 | valuation made       |         3 |    100.00 |      75.00 |
+> | 2024-01 | valuation submitted  |         3 |    100.00 |      75.00 |
+> | 2024-01 | solicitation         |         1 |     33.33 |      25.00 |
+> | 2024-01 | funds released       |         1 |    100.00 |      25.00 |
+> | ...     | ...                  |       ... |       ... |        ... |
 
 <!-- prettier-ignore -->
 >? TIP: **Hint 1**

diff --git a/docs/challenging-sql-problems/problems/silver/predicting-values.md b/docs/challenging-sql-problems/problems/silver/predicting-values.md
@@ -0,0 +1,52 @@
+# Predicting values 🎱
+
+> [!SUCCESS] Scenario
+>
+> Some students are studying [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and have been asked to predict the `y` values for a given set of `x` values for each of the four datasets using [linear regression](https://en.wikipedia.org/wiki/Linear_regression).
+
+> [!QUESTION]
+>
+> For each of the four datasets in Anscombe's quartet, use linear regression to predict the `y` values for `x` values `16`, `17`, and `18`.
+>
+> The output should have a row for each `x` value (`16`, `17`, `18`), with the columns:
+>
+> - `x`
+> - `dataset_1` as the predicted value for dataset 1, rounded to 1 decimal place
+> - `dataset_2` as the predicted value for dataset 2, rounded to 1 decimal place
+> - `dataset_3` as the predicted value for dataset 3, rounded to 1 decimal place
+> - `dataset_4` as the predicted value for dataset 4, rounded to 1 decimal place
+>
+> Order the output by `x`.
+
+<details>
+<summary>Expand for the DDL</summary>
+--8<-- "docs/challenging-sql-problems/problems/silver/predicting-values.sql"
+</details>
+
+There are plenty of resources online that walk through the maths behind linear regression, such as:
+
+- [https://www.youtube.com/watch?v=GAmzwIkGFgE](https://www.youtube.com/watch?v=GAmzwIkGFgE)
+
+The solution can be found at:
+
+- [predicting-values.md](../../solutions/silver/predicting-values.md)
+
+---
+
+<!-- prettier-ignore -->
+>? INFO: **Sample output**
+>
+> |   x | dataset_1 | dataset_2 | dataset_3 | dataset_4 |
+> |----:|----------:|----------:|----------:|----------:|
+> |  16 |      11.0 |      11.0 |      11.0 |      11.0 |
+> | ... |       ... |       ... |       ... |       ... |
+
+<!-- prettier-ignore -->
+>? TIP: **Hint 1**
+>
+> Unpivot the datasets so that you have a table with headers `dataset`, `x`, and `y`, then apply the linear regression, and finally pivot the results back.
+
+<!-- prettier-ignore -->
+>? TIP: **Hint 2**
+>
+> For databases that support them, use the `regr_slope` and `regr_intercept` functions (or equivalent) to calculate the slope and intercept of the regression line. Otherwise, you'll need to calculate these manually 😄
diff --git a/docs/challenging-sql-problems/problems/silver/predicting-values.sql b/docs/challenging-sql-problems/problems/silver/predicting-values.sql
@@ -0,0 +1,26 @@
+```sql
+create table anscombes_quartet (
+    dataset_1__x int,
+    dataset_1__y decimal(5, 2),
+    dataset_2__x int,
+    dataset_2__y decimal(5, 2),
+    dataset_3__x int,
+    dataset_3__y decimal(5, 2),
+    dataset_4__x int,
+    dataset_4__y decimal(5, 2),
+);
+insert into anscombes_quartet
+values
+    (10,  8.04, 10, 9.14, 10,  7.46,  8,  6.58),
+    ( 8,  6.95,  8, 8.14,  8,  6.77,  8,  5.76),
+    (13,  7.58, 13, 8.74, 13, 12.74,  8,  7.71),
+    ( 9,  8.81,  9, 8.77,  9,  7.11,  8,  8.84),
+    (11,  8.33, 11, 9.26, 11,  7.81,  8,  8.47),
+    (14,  9.96, 14, 8.10, 14,  8.84,  8,  7.04),
+    ( 6,  7.24,  6, 6.13,  6,  6.08,  8,  5.25),
+    ( 4,  4.26,  4, 3.10,  4,  5.39, 19, 12.50),
+    (12, 10.84, 12, 9.13, 12,  8.15,  8,  5.56),
+    ( 7,  4.82,  7, 7.26,  7,  6.42,  8,  7.91),
+    ( 5,  5.68,  5, 4.74,  5,  5.73,  8,  6.89)
+;
+```
diff --git a/docs/challenging-sql-problems/solutions/bronze/personalised-customer-emails.md b/docs/challenging-sql-problems/solutions/bronze/personalised-customer-emails.md
@@ -35,4 +35,6 @@ Some SQL solutions per database are provided below.
 <!-- prettier-ignore -->
 > SUCCESS: **SQL Server**
 >
+> This SQL Server solution uses the Soundex differences with a 3 (out of 4) match threshold, but this isn't the only way to solve this problem.
+>
 --8<-- "docs/challenging-sql-problems/solutions/bronze/personalised-customer-emails--sql-server.sql"
diff --git a/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--duckdb.sql b/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--duckdb.sql
@@ -4,8 +4,7 @@ with temperatures as (
         site_id,
         reading_datetime,
         temperature,
-        avg(temperature) over rows_around_site_reading as average_temperature,
-        count(temperature) over rows_around_site_reading as count_of_rows
+        avg(temperature) over rows_around_site_reading as average_temperature
     from readings
     window rows_around_site_reading as (
         partition by site_id
@@ -14,6 +13,7 @@ with temperatures as (
                  and 2 following
              exclude current row
     )
+    qualify 4 = count(*) over rows_around_site_reading
 )
 
 select
@@ -23,9 +23,7 @@ select
     round(average_temperature, 4) as average_temperature,
     round(100.0 * (temperature - average_temperature) / average_temperature, 4) as percentage_increase
 from temperatures
-where 1=1
-    and count_of_rows = 4
-    and (temperature - average_temperature) / average_temperature > 0.1
+where percentage_increase > 10
 order by
     site_id,
     reading_datetime

diff --git a/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--sqlite.sql b/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--sqlite.sql
@@ -0,0 +1,32 @@
+```sql
+with temperatures as (
+    select
+        site_id,
+        reading_datetime,
+        temperature,
+        avg(temperature) over rows_around_site_reading as average_temperature,
+        count(*) over rows_around_site_reading as count_of_rows
+    from readings
+    window rows_around_site_reading as (
+        partition by site_id
+        order by reading_datetime
+        rows between 2 preceding
+                 and 2 following
+             exclude current row
+    )
+)
+
+select
+    site_id,
+    reading_datetime,
+    temperature,
+    round(average_temperature, 4) as average_temperature,
+    round(100.0 * (temperature - average_temperature) / average_temperature, 4) as percentage_increase
+from temperatures
+where 1=1
+    and count_of_rows = 4
+    and (temperature - average_temperature) / average_temperature > 0.1
+order by
+    site_id,
+    reading_datetime
+```
diff --git a/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection.md b/docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection.md
@@ -26,10 +26,15 @@ Regardless of the database, the result set should look like:
 Some SQL solutions per database are provided below.
 
 <!-- prettier-ignore -->
-> SUCCESS: **DuckDB, SQLite, PostgreSQL**
+> SUCCESS: **DuckDB**
 >
 --8<-- "docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--duckdb.sql"
 
+<!-- prettier-ignore -->
+> SUCCESS: **SQLite, PostgreSQL**
+>
+--8<-- "docs/challenging-sql-problems/solutions/bronze/temperature-anomaly-detection--sqlite.sql"
+
 <!-- prettier-ignore -->
 > SUCCESS: **Snowflake**
 >

diff --git a/docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--manual.sql b/docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--manual.sql
@@ -0,0 +1,42 @@
+```sql
+with
+
+unpivoted as (
+    unpivot anscombes_quartet
+    on
+        (dataset_1__x, dataset_1__y) as dataset_1,
+        (dataset_2__x, dataset_2__y) as dataset_2,
+        (dataset_3__x, dataset_3__y) as dataset_3,
+        (dataset_4__x, dataset_4__y) as dataset_4,
+    into
+        name dataset
+        value x, y
+),
+
+coefficients as (
+    select
+        dataset,
+        avg(x) as avg_x,
+        avg(y) as avg_y,
+        avg(x * x) as avg_xx,
+        avg(x * y) as avg_xy,
+        (avg_x * avg_y - avg_xy) / (avg_x * avg_x - avg_xx) as m,
+        avg_y - m * avg_x as c,
+    from unpivoted
+    group by dataset
+),
+
+predictions as (
+    select
+        dataset,
+        x,
+        round(m * x + c, 1) as y
+    from coefficients
+        cross join (values (16), (17), (18)) as v(x)
+)
+
+pivot predictions
+on dataset
+using any_value(y)
+order by x
+```
diff --git a/docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--regr.sql b/docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--regr.sql
@@ -0,0 +1,38 @@
+```sql
+with
+
+unpivoted as (
+    unpivot anscombes_quartet
+    on
+        (dataset_1__x, dataset_1__y) as dataset_1,
+        (dataset_2__x, dataset_2__y) as dataset_2,
+        (dataset_3__x, dataset_3__y) as dataset_3,
+        (dataset_4__x, dataset_4__y) as dataset_4,
+    into
+        name dataset
+        value x, y
+),
+
+coefficients as (
+    select
+        dataset,
+        regr_slope(y, x) as m,
+        regr_intercept(y, x) as c,
+    from unpivoted
+    group by dataset
+),
+
+predictions as (
+    select
+        dataset,
+        x,
+        round(m * x + c, 1) as y
+    from coefficients
+        cross join (values (16), (17), (18)) as v(x)
+)
+
+pivot predictions
+on dataset
+using any_value(y)
+order by x
+```
diff --git a/docs/challenging-sql-problems/solutions/silver/predicting-values.md b/docs/challenging-sql-problems/solutions/silver/predicting-values.md
@@ -0,0 +1,39 @@
+# Predicting values 🎱
+
+> [!TIP]
+>
+> Solution to the following problem:
+>
+> - [predicting-values.md](../../problems/silver/predicting-values.md)
+
+## Result Set
+
+Regardless of the database, the result set should look like:
+
+|   x | dataset_1 | dataset_2 | dataset_3 | dataset_4 |
+| --: | --------: | --------: | --------: | --------: |
+|  16 |      11.0 |      11.0 |      11.0 |      11.0 |
+|  17 |      11.5 |      11.5 |      11.5 |      11.5 |
+|  18 |      12.0 |      12.0 |      12.0 |      12.0 |
+
+This is one of the interesting things about Anscombe's quartet (and is the reason Anscombe created it): the four datasets have the same summary statistics, but look very different when plotted!
+
+<details>
+<summary>Expand for the DDL</summary>
+--8<-- "docs/challenging-sql-problems/solutions/silver/predicting-values.sql"
+</details>
+
+## Solution
+
+Some SQL solutions per database are provided below.
+
+<!-- prettier-ignore -->
+> SUCCESS: **DuckDB**
+>
+> Here's a solution using the `regr_slope` and `regr_intercept` functions:
+>
+--8<-- "docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--regr.sql"
+>
+> ...and one doing this manually:
+>
+--8<-- "docs/challenging-sql-problems/solutions/silver/predicting-values--duckdb--manual.sql"
diff --git a/docs/challenging-sql-problems/solutions/silver/predicting-values.sql b/docs/challenging-sql-problems/solutions/silver/predicting-values.sql
@@ -0,0 +1,8 @@
+```sql
+select *
+from values
+    (16, 11.0, 11.0, 11.0, 11.0),
+    (17, 11.5, 11.5, 11.5, 11.5),
+    (18, 12.0, 12.0, 12.0, 12.0)
+as solution(x, dataset_1, dataset_2, dataset_3, dataset_4)
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -85,6 +85,7 @@ nav:
               - challenging-sql-problems/problems/silver/bannable-login-activity.md #  window functions (gaps and islands, range between)
               - challenging-sql-problems/problems/silver/bus-routes.md #  recursive CTE
               - challenging-sql-problems/problems/silver/region-precipitation.md #  unpivot and rollup
+              - challenging-sql-problems/problems/silver/predicting-values.md #  unpivot, regr, and pivot
               - challenging-sql-problems/problems/silver/customer-sales-running-totals.md #  window functions
           - 🟡 Gold Tier:
               - challenging-sql-problems/problems/gold/loan-repayment-schedule.md #  recursive CTE
@@ -101,6 +102,7 @@ nav:
               - challenging-sql-problems/solutions/silver/bannable-login-activity.md
               - challenging-sql-problems/solutions/silver/bus-routes.md
               - challenging-sql-problems/solutions/silver/region-precipitation.md
+              - challenging-sql-problems/solutions/silver/predicting-values.md
               - challenging-sql-problems/solutions/silver/customer-sales-running-totals.md
           - 🟡 Gold Tier:
               - challenging-sql-problems/solutions/gold/loan-repayment-schedule.md