[FEAT] read_sql #1943

colin-ho · 2024-02-22T20:39:45Z

Adds a new read method: read_sql(sql: str, url: str), which executes a given sql query on a given database url, and creates a Dataframe from the results.

Drive bys:

Added a from_arrow cast from pyarrow date64 to daft timestamp (both are millisecond precision) to allow connectorx reads with timestamp columns.

Features:

Uses connector-x, which reads directly into pyarrow via rust when supported, else fallback to SQL alchemy
Partitioned reads via limit and offset when supported
Integration tests for Postgres, MySQL, Trino
Pushdowns into base SQL query

codecov · 2024-02-23T19:11:35Z

Codecov Report

Attention: Patch coverage is 31.73077% with 142 lines in your changes are missing coverage. Please review.

Project coverage is 82.70%. Comparing base (bd78fa8) to head (bd65b05).
Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1943      +/-   ##
==========================================
- Coverage   84.67%   82.70%   -1.98%     
==========================================
  Files          58       62       +4     
  Lines        6363     6615     +252     
==========================================
+ Hits         5388     5471      +83     
- Misses        975     1144     +169

Files	Coverage Δ
daft/__init__.py	`24.24% <ø> (ø)`
daft/context.py	`76.22% <ø> (ø)`
daft/expressions/expressions.py	`91.46% <100.00%> (-0.04%)`	⬇️
daft/io/__init__.py	`95.45% <100.00%> (+0.21%)`	⬆️
daft/runners/partitioning.py	`82.15% <100.00%> (+0.25%)`	⬆️
daft/datatype.py	`92.01% <75.00%> (-0.23%)`	⬇️
daft/logical/schema.py	`91.22% <50.00%> (-0.74%)`	⬇️
daft/io/_sql.py	`52.94% <52.94%> (ø)`
daft/table/table_io.py	`89.95% <13.33%> (-5.38%)`	⬇️
daft/sql/sql_reader.py	`22.41% <22.41%> (ø)`
... and 1 more

... and 6 files with indirect coverage changes

daft/sql/sql_reader.py

src/daft-dsl/src/expr.rs

src/daft-scan/src/python.rs

tests/integration/sql/conftest.py

daft/sql/sql_reader.py

colin-ho · 2024-03-04T22:27:09Z

docs/source/api_docs/creation.rst

Refactored our creation docs as a drive by, before we had arrow, pandas, and file paths in a separate in-memory section

samster25

Nice work! Just some nits :)

samster25 · 2024-03-13T06:16:32Z

daft/sql/sql_reader.py

+                rows = result.fetchall()
+                pydict = {column_name: [row[i] for row in rows] for i, column_name in enumerate(result.keys())}
+
+                return pa.Table.from_pydict(pydict)


you also want to pass in the schema into from_pydict otherwise we are relying on the type inference of pyarrow

Made an issue to create type mappings from dbapi type codes

samster25 · 2024-03-13T06:18:43Z

daft/sql/sql_scan.py

+            pa_table = SQLReader(self.sql, self.url, limit=1).read()
+            schema = Schema.from_pyarrow_schema(pa_table.schema)
+            return schema
+        except Exception:


avoid catch all exception handling! What errors are you expecting here?

samster25 · 2024-03-13T06:19:02Z

daft/sql/sql_scan.py

+            schema = Schema.from_pyarrow_schema(pa_table.schema)
+            return schema
+        except Exception:
+            # If limit fails, try to read the entire table


Prob should log a warning if running it again

samster25 · 2024-03-13T06:22:06Z

daft/sql/sql_scan.py

+            self.url,
+            projection=["COUNT(*)"],
+        ).read()
+        return pa_table.column(0)[0].as_py()


you should put some checks to ensure that there is 1 column and 1 row (and raising an error) before indexing into it. This would lead to hard to debug stack traces for the end user!

samster25 · 2024-03-13T06:25:11Z

daft/sql/sql_scan.py

+                    for i, percentile in enumerate(percentiles)
+                ],
+            ).read()
+            bounds = [pa_table.column(i)[0].as_py() for i in range(num_scan_tasks - 1)]


perform checks that raise errors to ensure expected output before indexing into it

samster25 · 2024-03-13T06:59:24Z

src/daft-dsl/src/expr.rs

+                Expr::Alias(inner, ..) => to_sql_inner(inner, buffer),
+                Expr::BinaryOp { op, left, right } => {
+                    to_sql_inner(left, buffer)?;
+                    let op = match op {


you can just write directly into the buffer rather than converting to a string first (causes heap allocation) and then writing to the buffer

samster25 · 2024-03-13T07:00:30Z

src/daft-dsl/src/expr.rs

+                    to_sql_inner(right, buffer)
+                }
+                Expr::Not(inner) => {
+                    write!(buffer, "NOT (")?;


you should be able to collapse this
write!(buffer, "NOT ({})", o_sql_inner(inner, buffer)?)

samster25 · 2024-03-13T07:01:05Z

src/daft-dsl/src/expr.rs

+                    if_false,
+                    predicate,
+                } => {
+                    write!(buffer, "CASE WHEN ")?;


this should be able to be a single write! macro

samster25 · 2024-03-13T07:02:27Z

src/daft-dsl/src/lit.rs

@@ -212,6 +212,36 @@ impl LiteralValue {
        };
        result
    }
+
+    pub fn to_sql(&self) -> Option<String> {


no need to implement but you could have this be display_sql and take in a formatter as well!

samster25 · 2024-03-13T07:03:56Z

src/daft-scan/src/python.rs

+                size_bytes,
+                metadata: num_rows.map(|n| TableMetadata { length: n as usize }),
+                partition_spec: None,
+                statistics: None,


if the table is partitioned by some column, we could leverage that for statistics

Will make an issue for this to do as a follow on.

colin-ho · 2024-03-13T22:11:59Z

Nice work! Just some nits :)

Thanks! Addressed your feedback in latest commit. I also added functionality to insert limit pushdowns into SQL, as well as some general refactoring.

samster25

🚀

colin-ho changed the title ~~[FEAT] Read_SQL~~ [FEAT] read_sql Feb 22, 2024

github-actions bot added the enhancement New feature or request label Feb 22, 2024

colin-ho force-pushed the colin/read-sql branch from 771df31 to b1b07d8 Compare February 24, 2024 18:24

colin-ho requested review from clarkzinzow, jaychia and samster25 February 26, 2024 20:01

samster25 reviewed Feb 27, 2024

View reviewed changes

colin-ho force-pushed the colin/read-sql branch from 5f96728 to 2a187c3 Compare February 28, 2024 22:15

colin-ho requested a review from samster25 March 1, 2024 18:20

github-actions bot added the documentation Improvements or additions to documentation label Mar 1, 2024

colin-ho force-pushed the colin/read-sql branch from 91b1a0d to ad002d6 Compare March 4, 2024 21:35

colin-ho commented Mar 4, 2024

View reviewed changes

daft/sql/sql_reader.py Outdated Show resolved Hide resolved

colin-ho commented Mar 4, 2024

View reviewed changes

colin-ho force-pushed the colin/read-sql branch from 007f857 to b976a4d Compare March 7, 2024 21:13

colin-ho added 15 commits March 11, 2024 11:49

init

dbad9a3

int tests

1e86d0f

sql alchemy version

7d05a83

fix test

dd9ebbf

retry

f501dd1

retry all

fa085de

add try block

71af666

move retries out of fixture

c836819

move retries out of fixture

f24fa9d

add text to query

1237098

add text to query

f05ce21

fix assertion

2cb90fa

yay micropartitions always 1

f1836f1

add more integration tests + refactor

dfc0fe9

cleanup

2d51864

colin-ho added 14 commits March 11, 2024 11:50

everything except limit 0

f53bbb2

fix math

eab61b0

to_sql_inner

7e86e43

rename to apply_limit_before_offset

2f48c55

docs

ff389d3

improve literal to_sql, use more equality tests, and add todos

ccf1c4b

fix stuff from merge conflict

ff41a78

disable pushdowns in sql reader

b077bc5

disable pushdowns in sql reader

3c49a8a

revise partioning algo

f7ec4c9

refactor

2ca97fd

refactor some string args

7ac4131

add datetime support

5be0cd2

comment about timestamp

0a85439

samster25 reviewed Mar 13, 2024

View reviewed changes

refactor and add limit pushdown

bd65b05

colin-ho force-pushed the colin/read-sql branch from b976a4d to bd65b05 Compare March 13, 2024 21:44

colin-ho requested a review from samster25 March 13, 2024 22:09

samster25 approved these changes Mar 13, 2024

View reviewed changes

colin-ho merged commit b6b0a2e into main Mar 13, 2024
30 of 32 checks passed

colin-ho deleted the colin/read-sql branch March 13, 2024 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] read_sql #1943

[FEAT] read_sql #1943

colin-ho commented Feb 22, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

colin-ho Mar 4, 2024

samster25 left a comment

samster25 Mar 13, 2024

colin-ho Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

samster25 Mar 13, 2024

colin-ho Mar 13, 2024

colin-ho commented Mar 13, 2024

samster25 left a comment

[FEAT] read_sql #1943

[FEAT] read_sql #1943

Conversation

colin-ho commented Feb 22, 2024 • edited Loading

codecov bot commented Feb 23, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho commented Mar 13, 2024

samster25 left a comment

Choose a reason for hiding this comment

colin-ho commented Feb 22, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading