Do you need to import CSV files into a database but no one gave you the entity–relationship model to set up the primary and foreign keys?
Say no more.
The code performs following steps:
- Connects to the database with JDBC
- Reads metadata about the tables and columns in the specified schema
- Estimates the primary keys
- Estimates the foreign keys
- Returns SQL alter queries to the user
The algorithm works exclusively on the metadata about tables and columns that are accessible over JDBC. That means that the algorithm is blazing fast (as it does not look at the actual data), works with any database out of box (assuming a corresponding JDBC driver is provided) and the estimates are not affected by the data quality (that can be a both, an advantage or a disadvantage).
Accuracy: The algorithm correctly identifies ~98% of primary keys and ~90% of foreign keys as measured on 70 different databases.
Limitation: The algorithm was designed to work on databases that use surrogate PKs. Detection of composite PKs (and FKs that use them) is not supported.
Primary keys are identified by:
- Column position in the table (PKs are commonly at the the beginning)
- Data type (e.g. integers are preferred over doubles)
- Presence of a keyword like "id" in the column name
- Similarity of the column and table names as measured with Levenshtein distance
- Repetition of the column name in other tables
Once these features are collected, they are passed to logistic regression to estimate the probability that the column is the primary key. Since each table can have at most a single PK, the column with the highest probability in the table is declared to be the primary key of the table.
Foreign keys are identified by:
- Known PKs (relationships must be between a PK and non-PK)
- Data types (relationships must be between agreeing data types)
- Data type properties (e.g. data type sizes should agree)
- Similarity of the column names as measured with Levenshtein distance
- Dissimilarity of the FK name with the FK table name
Once again, probabilities are estimated with logistic regression. Since the count of foreign keys per table is not limited, all predictions above a threshold are declared to be foreign key constrains.