Estimate Primary & Foreign Keys in a Database

Do you need to import CSV files into a database but no one gave you the entity–relationship model to set up the primary and foreign keys?

Say no more.

Usage

The code performs following steps:

Connects to the database with JDBC
Reads metadata about the tables and columns in the specified schema
Estimates the primary keys
Estimates the foreign keys
Returns SQL alter queries to the user

Algorithm

The algorithm works exclusively on the metadata about tables and columns that are accessible over JDBC. That means that the algorithm is blazing fast (as it does not look at the actual data), works with any database out of box (assuming a corresponding JDBC driver is provided) and the estimates are not affected by the data quality (that can be a both, an advantage or a disadvantage).

Accuracy: The algorithm correctly identifies ~98% of primary keys and ~90% of foreign keys as measured on 70 different databases.

Limitation: The algorithm was designed to work on databases that use surrogate PKs. Detection of composite PKs (and FKs that use them) is not supported.

Primary Key

Primary keys are identified by:

Column position in the table (PKs are commonly at the the beginning)
Data type (e.g. integers are preferred over doubles)
Presence of a keyword like "id" in the column name
Similarity of the column and table names as measured with Levenshtein distance
Repetition of the column name in other tables

Once these features are collected, they are passed to logistic regression to estimate the probability that the column is the primary key. Since each table can have at most a single PK, the column with the highest probability in the table is declared to be the primary key of the table.

Foreign keys

Foreign keys are identified by:

Known PKs (relationships must be between a PK and non-PK)
Data types (relationships must be between agreeing data types)
Data type properties (e.g. data type sizes should agree)
Similarity of the column names as measured with Levenshtein distance
Dissimilarity of the FK name with the FK table name

Once again, probabilities are estimated with logistic regression. Since the count of foreign keys per table is not limited, all predictions above a threshold are declared to be foreign key constrains.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
test		test
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
data.ods		data.ods
data2.ods		data2.ods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estimate Primary & Foreign Keys in a Database

Usage

Algorithm

Primary Key

Foreign keys

About

Releases

Packages

Contributors 2

Languages

mmmateos/data-processing

Folders and files

Latest commit

History

Repository files navigation

Estimate Primary & Foreign Keys in a Database

Usage

Algorithm

Primary Key

Foreign keys

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages