In this work, we constructed Weighted Cell-Specific Networks (WCSN) based on highly variable genes, capturing both gene expression patterns and gene-gene interaction strengths. A graph neural network is then employed to extract features from the WCSN, enabling accurate cell type annotation. We term our model WCSGNet.
- ubuntu18.04
- RTX 4080 SUPER (16G)
Requirements | Release |
---|---|
CUDA | 12.7 |
Python | 3.8.18 |
torch | 1.12.1 |
torch_geometric | 2.5.1 |
numpy | 1.24.1 |
scikit-learn | 1.3.1 |
tqdm | 4.66.2 |
pandas | 2.0.3 |
matplotlib | 3.7.5 |
This folder stores the code files.
-
DataPreprocessing
Jupyter notebooks for preprocessing individual datasets. The processed data generated by these scripts is saved in the
dataset/pre_data/scRNAseq_datasets
directory.Gene_interaction.ipynb
: Jupyter notebooks for preprocessing high-confidence gene interaction data. The processed data is saved in thedataset/pre_data/Network
directory.Mouse_genes_ncbi2.ipynb
: Jupyter notebooks for preprocessing mouse genes. -
data_partitioning.py
This file stores the five-fold cross-validation splits for the corresponding dataset. The generated files will be stored in the
dataset/5fold_data
folder. -
gene_filter.py
This file can store the indices of the highly variable genes (eg. 2000) in the gene expression matrix for each dataset.
-
up_sample.py
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npy
in the corresponding scRNA-seq dataset directory underdataset/5fold_data/
. -
wcsn_constr_train.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set.
-
wcsn_constr_test.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold testing set.
-
model.py
This file contains the code for the WCSGNet model.
-
datasets_wcsn.py
This file defines a custom PyTorch Geometric dataset class
MyDataset
, which is designed to handle graph data. There is no need to supplement the WCSN; it directly reads the WCSN data. -
datasets_LT.py
This file defines a custom PyTorch Geometric dataset class
MyDataset2
, which is designed to handle graph data. Its main functionalities include applying a logarithmic transformation to the WCSN weights. -
datasets_BT.py
This file defines a custom PyTorch Geometric dataset class
MyDataset2
, which is designed to handle graph data. Its main functionalities include using the binary transformation assign 1 to all edges. -
wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN and saves them in
result/models
. -
wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN and saves them in
result/preds
. -
LT_wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in
result/models_LT
. -
LT_wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in
result/preds_LT
. -
BT_wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in
result/models_BT
. -
BT_wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in
result/preds_BT
.
Storage of Downloaded Raw Data
-
scRNAseq_Benchmark_datasets
The downloaded scRNA-seq datasets include: Muraro, Segerstolpe, Zheng 68k, Zhang_T, Kang, Baron, AMB, and TM.
Storing preprocessed data, five-fold splits of the running data, Entrez Gene IDs of genes, selected highly variable genes, generated high-confidence interaction subnetworks, WCSN data, etc.
-
pre_data
scRNA-seq_datasets: Preprocessed scRNA-seq datasets generated using the
.ipynb
files located in thesrc/DataProcessing
directory. -
5fold_data
Store the data generated during the processing of each scRNA-seq dataset. This includes the five-fold splits of the dataset, the filtered list of highly variable genes, the indices obtained from up-sampling thetraining set, and the WCSNs generated for each fold of the training and testing sets.
-
Figures
Storage result diagram.
-
models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/wcsn_classify_train.py
-
models_LT
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/LT_wcsn_classify_train.py
-
models_BT
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/BT_wcsn_classify_train.py
-
preds
This folder contains the predicted results generated by
src/wcsn_classify_test.py
-
preds_LT
This folder contains the predicted results generated by
src/LT_ewcsn_classify_test.py
. -
preds_BT
This folder contains the predicted results generated by
src/BT_ewcsn_classify_test.py
.
Muraro, Segerstolpe, Zheng 68k, Baron, AMB, and TM: Available for direct download from Zenodo.
Zhang T: Accessible via GEO under accession number GSE108989.
Kang: Accessible via GEO under accession number GSE96583.
Save to the data/scRNAseq_Benchmark_datasets
directory.
-
Initial preprocessing
In the GitHub project's section 3.1.1, the scRNA-seq datasets are downloaded and require initial preprocessing using the
ipynb
files in thesrc/DataProcessing
directory. The preprocessing steps include:- Filtering out cell types with fewer than 10 cells and cells with unclear annotations.
- Filtering out genes expressed in fewer than 10 cells.
After preprocessing, the resulting data should be saved in the
dataset/pre_data/scRNA-seq_datasets
directory.
-
Five-Fold Cross-Validation Splits
! python src/data_partitioning.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/', Specify the output directory.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step generates a seq_dict.npz
file for each dataset located in the dataset/5fold_data/
directory. These files are used to store the five-fold cross-validation splits for the corresponding dataset, ensuring consistent and reproducible training and evaluation.
-
Up-sampling
! python src/up_sample.py
Optional parameters
-
-expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='dataset/5fold_data/', Specify the output directory.
-
--n_splits: Default=5, Indicates Five-fold cross-validation.
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npy
in the corresponding scRNA-seq dataset directory underdataset/5fold_data/
. -
-
Selection of highly variable genes(HVGs)
! python src/gene_filter.py
Optional parameters
-
-expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='dataset/5fold_data/', Specify the output directory.
-
-hvgs: Default=2000, Specify the number of HVGs.
This step generates a
.npy
file for each dataset, containing 2000 HVGs. The file stores the indices of the highly variable genes in the gene expression matrix. -
! python src/wcsn_constr_train.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set. The graph for each training set cell is saved as a .pt
file in the processed
folder of the corresponding fold (e.g., train_f1
) within the WCSN_a0.01_hvgs2000
folder, which is located under the corresponding dataset folder in dataset/5fold_data/
.
! python src/wcsn_constr_test.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs a 5-fold WCSN based on highly variable genes for each scRNA-seq dataset's testing set. The graph for each testing set cell is saved as a .pt
file in the processed
folder of the corresponding fold (e.g., test_f1
) within the WCSN_a0.01_hvgs2000
folder, which is located under the corresponding dataset folder in dataset/5fold_data/
.
Training
! python src/wcsn_classify_train.py
Optional parameters
-
-expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32, The batch size of this training.
This step generates the 5-fold training set models using WCSN and saves them in result/models
.
Testing
! python src/wcsn_classify_test.py
Optional parameters
-
-expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32.
This step generates the predicted results for the testing sets and saves them in result/preds
.
The results include:
*_Prediction.h5
: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.
*_F1.csv
: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.
Training
! python src/LT_wcsn_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/models_LT
.
Testing
! python src/LT_wcsn_classify_test.py
The optional parameters are mostly the same as those in wcsn_classify_test.py.
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/preds_LT
. The results include *_Prediction.h5
and *_F1.csv
.
Training
! python src/BT_wcsn_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in result/models_BT
.
Testing
! python src/BT_wcsn_classify_test.py
The optional parameters are mostly the same as those in wcsn_classify_test.py.
This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in result/preds_BT
. The results include *_Prediction.h5
and *_F1.csv
.
All drawing codes are from
src/Figures/
-
Figure 2
src/Figures/Figure2.ipynb
Sankey diagram of the different datasets under WCSGNet's 5-fold cross-validation.
-
Figure 3(A) and (B)
src/Figures/Figure3AB.ipynb
Performance of WCSGNet with different edge weight representation methods, including the original method, binary transformation and binary transformation .
-
Figure 4(A-N)
BaronHuman_analysis.ipynb
src/Figures/R/Figure4A-M.R
Top degree gene analysis of WCSN for different cell types on the Baron Human dataset
-
Figure 5(A-N)
BaronHuman_analysis.ipynb
src/Figures/R/Figure5A-M.R
Top high-weight edges analysis of WCSNs for different cell types in the Baron Human dataset.
-
Figure 6(A-N)
src/Figure/Figure6.py
T-SNE visualization and feature analysis of the Baron Human dataset using WCSGNet.
-
Figure 7(A-H)
src/Figures/AMB_analysis.ipynb
src/Figures/R/Figure7A-D.R
src/Figures/R/Figure7E-H.R
Analysis of top degree genes and high-weight edges in WCSN for Different Cell Types on the AMB Dataset.
-
Figure 8(A) and (B)
src/Figures/Figure7AB.ipynb
Mean F1-score and accuracy comparison of WCSGNet on Muraro and Baron Mouse datasets curated with varying numbers of HVGs.
The following factors may result in slight differences in the Mean F1-score and Accuracy for cell type classification when reproducing the results, compared to those reported in the paper.
- The DataLoader applies a shuffle operation on the training dataset during model training, leading to some randomness in the input sequence of the training data.
- The use of the Dropout mechanism in the model introduces variability in the trained models across different runs.
- Parameter initialization also produces some randomness.
However, these differences do not have a disruptive impact on the conclusions of the paper.