Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the BFSStrategy using Pandas read_sql_query #25

Open
neumannjan opened this issue Sep 26, 2023 · 0 comments
Open

Rewrite the BFSStrategy using Pandas read_sql_query #25

neumannjan opened this issue Sep 26, 2023 · 0 comments

Comments

@neumannjan
Copy link
Collaborator

The HeteroDataBuilder currently does the following:

  • loads each table (in full) using pd.read_sql
  • computes the edge_index for each relation using Pandas on top of the loaded tables

This is really fast. Since Pandas also supports pd.read_sql_query for any SQL query built using SQLAlchemy, I propose to rewrite BFSStrategy using Pandas as well. I expect that the benefits may be speed (hopefully), cleaner code, and results that will be more consistent with HeteroDataBuilder (as the new type converters use Pandas anyway as well - also for speed reasons).

I think the new BFSStrategy could work as follows:

  • load the target table (or a batch from the target table) using a single call to pd.read_sql
  • then load the joins like that as well within the BFS
  • then the edge_index computation can be done at the end similarly as I do it (hopefully)

Then at the end we should probably merge HeteroDataBuilder with Dataset and somehow find a nice way to have it as two different strategies for the dataset ("full strategy" vs "bfs strategy").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant