Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leverage ibis expression for getting readablerelations #2046

Open
wants to merge 13 commits into
base: devel
Choose a base branch
from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 10, 2024

Description

Implements support for using ibis expressions to generate sql statements for relations. To this end this PR implements a new type of readable relation which gets used that wraps an ibis unboundtable expression, but still accesses data the old way.

Todos:

  • Figure out typing

Copy link

netlify bot commented Nov 10, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit acd329f
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6745ad7a7bf17200086e948e
😎 Deploy Preview https://deploy-preview-2046--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sh-rp sh-rp self-assigned this Nov 10, 2024
@sh-rp sh-rp linked an issue Nov 10, 2024 that may be closed by this pull request
@sh-rp sh-rp changed the title [Experiment] Leverage ibis expressions & sqlot do to the query building in our Readable Relations [Experiment] Leverage ibis expressions & sqlglot do to the query building in our Readable Relations Nov 10, 2024
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 11, 2024

Another note to self, we probably need to run columns names through the normalizer. Or we assume the user will use normalized names as they are present in the schema when building these expressions.

df = items_table.df()
assert len(df.index) == total_records

df = double_items_table.df()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular dlt dataset execution methods (df, arrow, iter_arrow...) work everywhere

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool! IMO we should keep ibis as optional dependency (it only works with python 3.10+). so we have two options:

  • separate relation
  • enable the proxy behavior if ibis is found, if not we fallback to the current behavior

i'd probably go for the second option. I'm just a little bit worried about the typing

in both cases we should implement a few common expressions we already have in our existing relation (limit, head, column selection).

regarding the schema: column lineage you can do with sqlglot. it makes sense to invest a little bit of time to understand how it is done:

we can add sqlglot as a regular dependency. and use it everywhere we have sql SELECT statements.

@sh-rp sh-rp changed the title [Experiment] Leverage ibis expressions & sqlglot do to the query building in our Readable Relations leverage ibis expression for getting readablerelations Nov 13, 2024
@sh-rp sh-rp linked an issue Nov 19, 2024 that may be closed by this pull request
5 tasks

def select(self, *columns: str) -> "ReadableIbisRelation":
"""set which columns will be selected"""
return self._proxy_expression_method("select", *columns) # type: ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the other ibis methods are not defined here yet. we'd need to add them to the interface and raise an error if they are called when the regular relation is returned I think. wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to type dlt.dataset and pipeline.dataset properly (see my other comment). and create a Protocol for ibis expressions to type this class

@sh-rp sh-rp marked this pull request as ready for review November 26, 2024 11:23
@sh-rp sh-rp requested a review from rudolfix November 26, 2024 11:23
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 26, 2024

@rudolfix the only thing still missing here is proper typing for this ibis wrapper. The "brute force" way of doing it would be to define all the methods of the ibis expression we can make use of on the SupportsReadableRelation Interface, forward those calls in the ibirelation like we do with limit etc, and raise on the regular relation. Alternatively I could do some kind of wildcard typing to make the linter shut up, but it would be less strict. Maybe you can give me your opinion, if you don't mind, then I'll just decide on one.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! tldr;>

  • we can track schema changes in 90% of cases easily (or at least as good as in upstream object)
  • we can do better with typing but maybe in separate PR
  • tests need to be better
  • can we do without dynamic installation for ibis?

@@ -119,3 +137,37 @@ def create_ibis_backend(
con = ibis.duckdb.from_connection(duck)

return con


def create_unbound_ibis_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move ibis module to helpers? you are importing modules that are not in common. so we really need to refer to ibis in common?

"dlt.destinations.filesystem": "duckdb", # works
"dlt.destinations.dremio": "presto", # works
# NOTE: can we discover the current dialect in sqlalchemy?
"dlt.destinations.sqlalchemy": "mysql", # may work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by running configure() on factory with partial option. OFC we need the dialect part somewhere configured ie in partial connection string (without password)


def select(self, *columns: str) -> "ReadableIbisRelation":
"""set which columns will be selected"""
return self._proxy_expression_method("select", *columns) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to type dlt.dataset and pipeline.dataset properly (see my other comment). and create a Protocol for ibis expressions to type this class



# TODO: provide ibis expression typing for the readable relation
class ReadableIbisRelation(BaseReadableDBAPIRelation):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any protocol or base class in ibis that we can add as a base to get correct typing? otherwise we'd need to add all the methods ourselves.

from dlt.destinations.dataset.dataset import ReadableDBAPIDataset


def dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to improve typing in this PR? if so we could

  1. make SupportsReadableDataset generic where it is parametrized by relation type (bound to SupportsReadableRelation)
  2. use overload on the literal TDatasetType when in auto/ibis you return SupportsReadableDataset parametrized by ReadableIbisRelation, otherwise by just SupportsReadableRelation
  3. we can make (2) dynamic dependent if ibis is present


def __getitem__(self, columns: Union[str, Sequence[str]]) -> "ReadableIbisRelation":
# casefold column-names
columns = [self.sql_client.capabilities.casefold_identifier(col) for col in columns]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is column getter right? so you will receive ibis Column object(s) right? then maybe select them from the schema like we do in regular relation?

this will cover many cases already at make ibis backward compatible with base expression

def compute_columns_schema(self) -> TTableSchemaColumns:
"""provide schema columns for the cursor, may be filtered by selected columns"""
# TODO: provide column lineage tracing with sqlglot lineage
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could try some simple implementation:

  1. we preserve table schema until first expression is made
  2. we preserve table schema for head limit and select method
  3. maybe we also recognize a few exceptions that do not modify schema ie filter and free form filter expressions as here: https://ibis-project.org/tutorials/ibis-for-pandas-users#filtering-rows

if attr is None:
raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")

if not callable(attr):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ibis supports a column getter with a dot notation. my take would be to

  • start supporting it upstream
  • modify the schema accordingly

# ensure ibis is installed for these tests
import subprocess

subprocess.check_call(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hardcore. arent we able to use dependency group for ibis?

assert len(df.index) == 5

# check chained expression with join, column selection, order by and limit
joined_table = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need to test all the expressions but you should test various expression forms:

  1. those that return Table (like here)
  2. those that select a column
  3. a single column getter with . notation
  4. a free form filters:
expr = penguins.bill_length_mm > 37.0
  1. materializations like their dataframe and arrow getters
  2. there are also expressions that add and remove columns to schema
  3. expressions that return Expr (not Table - those do some weird things, I'm not sure SQL can be generated for them)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants