DeepMicroCancer is a diagnostic model for cancer diagnosis using transfer learning techniques for various cancer types. The model is built using a combination of Random Forest and Transfer Learning techniques. The predict.py module is used to quickly predict the labels of input samples.
To run the predict module of DeepMicroCancer, simply use the following command:
python predict.py -i abundance.csv
-l labels.csv
-m model
-t model_type
-f fig_name
-o output_directory
-i
: A CSV file containing the abundance of microbial communities in the sample. The rows represent the hosts and the columns represent the features. The abundance file should be generated by Kraken and preprocessed using Voom and SNM (supervised normalization) to reduce batch effects. More information on preprocessing can be found here. The format of the file should look like:
microbe1 | microbe2 | ... | |
---|---|---|---|
host1 | 0.01 | 0.05 | ... |
host2 | 0 | 0.02 | ... |
... | ... | ... | ... |
-l
: Optional. If provided, DeepMicroCancer will calculate the AUROC and plot the ROC curve. This file is a CSV file containing the label of each host. The first column contains the index of each host and the second column named disease_type contains the label of each host, like:
SampleID | disease_type |
---|---|
host1 | status1 |
host2 | status2 |
host3 | status3 |
... | ... |
-m
: A DeepMicroCancer model trained using train.py or transfer.py. There are three models (tissue_model, blood_model, and tissue-to-blood_model) available in the model directory. Choose the appropriate model based on your sample type.
-t
: Specify the type of model. If the model was trained using train.py, set this parameter to independent. If the model was trained using transfer.py, set this parameter to transfer.
-f
: Optional. The name of the figure to save the AUROC plot.
-o
: The output directory to save the predict result. The result will include a CSV file with the predict results and an AUROC figure (if labels are provided).
Using the test dataset and the tissue-to-blood_model as an example:
python predict.py -i data/blood/X_test.csv \
-l data/blood/y_test.csv \
-m models/tissue-blood_model \
-t transfer -f tissue-blood \
-o results/tissue-blood
The project is divided into four scripts, each with its own function in the overall workflow.
To install the required packages, run the following command:
pip install -r requirements.txt
Unzip the data files in the data
folder.
cat data/data.* > data/tmp.zip
unzip data/tmp.zip -d data
The split_dataset.py
script is used to split the dataset into training and testing sets. The script takes the path to the features and labels files in csv format as input and generates the training and testing datasets.
-x, --features: The path to a features file as csv format, each row is a sample, each column is a feature
-y, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-s, --test_size: The size of the test set (default = 0.3)
-o, --output: The path to save the output files
Split the tissue dataset and blood dataset into training and testing sets with a test size of 30% and 20% respectively.
python split_dataset.py -x data/tissue_snm.csv \
-s 0.3 \
-y data/tissue_meta.csv \
-o data/tissue
python split_dataset.py -x data/blood_snm.csv \
-s 0.2 \
-y data/blood_meta.csv \
-o data/blood
The build_model.py
script builds the Random Forest classifier model. The script takes the path to the training features and labels files in csv format as input and outputs a saved model. Saved model contains three files: model.joblib
contains the model parameters, features.txt
contains the features used to build the model, and label_encoder.joblib
contains the label encoder used to encode the labels.
-x, --features: The path to a features file as csv format, each row is a sample, each column is a feature
-y, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-o, --output: The path to save the model
Build the Random Forest classifier model for the tissue and blood datasets.
python build_model.py -x data/tissue/X_train.csv \
-y data/tissue/y_train.csv \
-o models/tissue_model
python build_model.py -x data/blood/X_train.csv \
-y data/blood/y_train.csv \
-o models/blood_model
We use the seed 0 to split the dataset and seed 13 to build the model to make sure that the results are reproducible. The seed can be changed by changing the seed
variable in the split_dataset.py
and build_model.py
scripts.
The transfer.py
script is used to transfer the model from one dataset to another. The script takes the path to the source model, source features and labels files, target features and labels files in csv format as input and outputs a saved model. The source model should be built using the build_model.py
script.
-s, --source_model: The path to the source model
-sf, --source_features: The path to a source features file as csv format, each row is a sample, each column is a feature
-sl, --source_labels: The path to a source labels file as csv format, each row is a sample, the disease_type column is the label
-tf, --target_features: The path to a target features file as csv format, each row is a sample, each column is a feature
-tl, --target_labels: The path to a target labels file as csv format, each row is a sample, the disease_type column is the label
-o, --output: The path to save the model
Transfer the tissue model to the blood dataset.
python transfer.py -s models/tissue_model \
-sf data/tissue/X_train.csv \
-sl data/tissue/y_train.csv \
-tf data/blood/X_train.csv \
-tl data/blood/y_train.csv \
-o models/tissue-blood_model
The predict.py
script is used to predict the labels of the testing dataset. The script takes the path to the testing features and labels files (optional, if the labels file is provided the script will output the predicted labels and the AUROC plot) in csv format , the model build using the build_model.py
or transfer.py
script, the type of the model, either independent or transfer, the name of the figure to save (optional) the AUROC plot and the output path to save the results.
-i, --input: The path to a features file as csv format, each row is a sample, each column is a feature
-l, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-m, --model: The path to the model
-t, --type: The type of the model, either independent or transfer
-f, --fig_name: The name of the figure to save
-o, --output: The path to save the results
Predict the labels of the testing dataset for the tissue using the tissue model.
python predict.py -i data/tissue/X_test.csv \
-l data/tissue/y_test.csv \
-m models/tissue_model \
-t independent \
-f tissue-tissue \
-o results/tissue-tissue
Predict the labels of the testing dataset for the blood using the blood model.
python predict.py -i data/blood/X_test.csv \
-l data/blood/y_test.csv \
-m models/blood_model \
-t independent \
-f blood-blood \
-o results/blood-blood
Predict the labels of the testing dataset for the blood using the tissue-blood model.
python predict.py -i data/blood/X_test.csv \
-l data/blood/y_test.csv \
-m models/tissue-blood_model \
-t transfer -f tissue-blood \
-o results/tissue-blood
We use the seed 0 to split the dataset and seed 13 to build the model to make sure that the results are reproducible. The seed can be changed by changing the seed
variable in the split_dataset.py
and build_model.py
scripts.
The feature importances of the tissue model and the blood model for each cancer type are calculate using the feature_importances.py
script. Each cancer type is considered as a binary classification problem and the output is saved in feature_importances
folder.
run the script using the following command:
python feature_importances.py