-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] read_sql #1943
[FEAT] read_sql #1943
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1943 +/- ##
==========================================
- Coverage 84.67% 82.70% -1.98%
==========================================
Files 58 62 +4
Lines 6363 6615 +252
==========================================
+ Hits 5388 5471 +83
- Misses 975 1144 +169
|
771df31
to
b1b07d8
Compare
5f96728
to
2a187c3
Compare
docs/source/api_docs/creation.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored our creation docs as a drive by, before we had arrow, pandas, and file paths in a separate in-memory section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just some nits :)
rows = result.fetchall() | ||
pydict = {column_name: [row[i] for row in rows] for i, column_name in enumerate(result.keys())} | ||
|
||
return pa.Table.from_pydict(pydict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also want to pass in the schema into from_pydict
otherwise we are relying on the type inference of pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an issue to create type mappings from dbapi type codes
daft/sql/sql_scan.py
Outdated
pa_table = SQLReader(self.sql, self.url, limit=1).read() | ||
schema = Schema.from_pyarrow_schema(pa_table.schema) | ||
return schema | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid catch all exception handling! What errors are you expecting here?
daft/sql/sql_scan.py
Outdated
schema = Schema.from_pyarrow_schema(pa_table.schema) | ||
return schema | ||
except Exception: | ||
# If limit fails, try to read the entire table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prob should log a warning if running it again
daft/sql/sql_scan.py
Outdated
self.url, | ||
projection=["COUNT(*)"], | ||
).read() | ||
return pa_table.column(0)[0].as_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should put some checks to ensure that there is 1 column and 1 row (and raising an error) before indexing into it. This would lead to hard to debug stack traces for the end user!
daft/sql/sql_scan.py
Outdated
for i, percentile in enumerate(percentiles) | ||
], | ||
).read() | ||
bounds = [pa_table.column(i)[0].as_py() for i in range(num_scan_tasks - 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perform checks that raise errors to ensure expected output before indexing into it
Expr::Alias(inner, ..) => to_sql_inner(inner, buffer), | ||
Expr::BinaryOp { op, left, right } => { | ||
to_sql_inner(left, buffer)?; | ||
let op = match op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just write directly into the buffer rather than converting to a string first (causes heap allocation) and then writing to the buffer
to_sql_inner(right, buffer) | ||
} | ||
Expr::Not(inner) => { | ||
write!(buffer, "NOT (")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to collapse this
write!(buffer, "NOT ({})", o_sql_inner(inner, buffer)?)
if_false, | ||
predicate, | ||
} => { | ||
write!(buffer, "CASE WHEN ")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be able to be a single write!
macro
src/daft-dsl/src/lit.rs
Outdated
@@ -212,6 +212,36 @@ impl LiteralValue { | |||
}; | |||
result | |||
} | |||
|
|||
pub fn to_sql(&self) -> Option<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to implement but you could have this be display_sql
and take in a formatter as well!
size_bytes, | ||
metadata: num_rows.map(|n| TableMetadata { length: n as usize }), | ||
partition_spec: None, | ||
statistics: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the table is partitioned by some column, we could leverage that for statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make an issue for this to do as a follow on.
b976a4d
to
bd65b05
Compare
Thanks! Addressed your feedback in latest commit. I also added functionality to insert limit pushdowns into SQL, as well as some general refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Closes #1560
Adds a new read method:
read_sql(sql: str, url: str)
, which executes a given sql query on a given database url, and creates a Dataframe from the results.Drive bys:
Features: