The ASyH is a software helping Clinics as holders of large quantities of highly restricted personal health data to provide the Medical Data Community with realistic datasets without the breach of privacy. It does this by synthesizing data with Machine Learning techniques which preserve data distribution and correlation while adding as much variation to the synthetic data, in order for it to have no resemblance to any of the original patient data entries.
For synthesis, metrics and quality assurance we will mainly use the Synthetic Data Vault (github).
Using pip, the easiest way to install/upgrade ASyH is
pip install --upgrade https://github.com/dieterich-lab/ASyH/tarball/v1.0.2
The most basic use case for ASyH is to create an ASyH Application object and call synthesize() to get a synthetic dataset from the best-performing SDV model/synthesizer (one of CopulaGAN, CTGAN, GaussianCopula, or TVAE [cf. the SDV documentation]). The input original dataset should be provided as a pandas DataFrame, the synthesized dataset is output as pandas DataFrame as well. For identification of numerical and categorical variables, a metadata file in JSON format needs to be provided (see below).
import ASyH
asyh = ASyH.Application()
synthetic_data = asyh.synthesize('original_data.csv', metadata_file='metadata.json')
# write the synthetic dataset to CSV file:
synthetic_data.to_csv(output_file, index=False)
Alternatively, you can specify an Excel file as first argument to asyh.synthesize(.,.)
Additionally, a report of the output data quality (in terms of similarity to the original data) can be generated with (appended to the above code, in the same script file)
import ASyH
import pandas
import json
# We will need the original dataset as pandas DataFrame
original_data = pandas.read_csv('input_data.csv')
# We also need the metadata as a dict:
with open('metadata.json', 'r', encoding='utf-8') as md_file:
metadata = json.load(md_file)
asyh = ASyH.Application()
synthetic_data = asyh.synthesize(input_data.csv', metadata_file='metadata.json')
# the following will create the md file
# report.md
# and, if an installation of TeXLive and pandoc is available
# report.pdf
report = ASyH.Report(original_data, synthetic_data, metadata)
report.generate('report', asyh.model.model_type)
you will find a zip archive with all images, the markdown file (if generated the PDF as well), and the synthetic data in a CSV file. Mind that the above code assumes that the metadata specifies the table name as 'data'.
ASyH uses SDV's metadata format (cf. 'Metadata' in the SDV documentation).
The skeleton of the JSON file should look like the following
{"columns":
{ ...column specifications...
},
"primary_key":...
}
Specifying a primary_key
is optional.
The column specifications
are of the form
"COLUMN_NAME": {"sdtype": "COLUMN_TYPE"}
or
"COLUMN_NAME": {"sdtype": "COLUMN_TYPE", "SPECIFIER": SPECIFIER_VALUE}
where COLUMN_NAME
is a column variable's name and COLUMN_TYPE
is on of (numerical, datetime, categorical, boolean, id)
. The SPECIFIER
/SPECIFIER_VALUE
pair to use depends on the sdtype
of the variable, it does not apply to boolean and categorical variables, otherwise, they are:
-
computer_representation
for numerical variables.
Allowed values are"Float"
,"Int8"
,"Int16"
,"Int32"
,"Int64"
,"UInt8"
,"UInt16"
,"UInt32"
,"UInt64"
-
regex_format
forid
variables.
The regex string should use Perl-style regular expression syntax (cf. also the Python documentation). -
datetime_format
is required for datetime type variables.
TheSPECIFIER_VALUE
for this specifier is a string in strftime format.
To do development on this software do this:
-
Check out the repository
-
Create a Python venv for the project
-
Activate the venv
-
Install the package editable (-e) with the test dependencies:
pip install -e '.[tests]'
To run the tests set the PYTHONPATH and execute pytest on the 'tests' folder:
export PYTHONPATH=$(pwd)
pytest tests
Release | Date |
---|---|
1.0.0 | 25/05/2023 |