Rewrite the BFSStrategy using Pandas read_sql_query #25

neumannjan · 2023-09-26T14:57:10Z

The HeteroDataBuilder currently does the following:

loads each table (in full) using pd.read_sql
computes the edge_index for each relation using Pandas on top of the loaded tables

This is really fast. Since Pandas also supports pd.read_sql_query for any SQL query built using SQLAlchemy, I propose to rewrite BFSStrategy using Pandas as well. I expect that the benefits may be speed (hopefully), cleaner code, and results that will be more consistent with HeteroDataBuilder (as the new type converters use Pandas anyway as well - also for speed reasons).

I think the new BFSStrategy could work as follows:

load the target table (or a batch from the target table) using a single call to pd.read_sql
then load the joins like that as well within the BFS
then the edge_index computation can be done at the end similarly as I do it (hopefully)

Then at the end we should probably merge HeteroDataBuilder with Dataset and somehow find a nice way to have it as two different strategies for the dataset ("full strategy" vs "bfs strategy").

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the BFSStrategy using Pandas read_sql_query #25

Rewrite the BFSStrategy using Pandas read_sql_query #25

neumannjan commented Sep 26, 2023

Rewrite the BFSStrategy using Pandas read_sql_query #25

Rewrite the BFSStrategy using Pandas read_sql_query #25

Comments

neumannjan commented Sep 26, 2023