Skip to content

Yi-ellen/WCSGNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WCSGNet

In this work, we constructed Weighted Cell-Specific Networks (WCSN) based on highly variable genes, capturing both gene expression patterns and gene-gene interaction strengths. A graph neural network is then employed to extract features from the WCSN, enabling accurate cell type annotation. We term our model WCSGNet.

1. Platform and Dependency

1.1 Platform

  • ubuntu18.04
  • RTX 4080 SUPER (16G)

1.2 Dependency

Requirements Release
CUDA 12.7
Python 3.8.18
torch 1.12.1
torch_geometric 2.5.1
numpy 1.24.1
scikit-learn 1.3.1
tqdm 4.66.2
pandas 2.0.3
matplotlib 3.7.5

2. Project Catalog Structure

2.1 src

This folder stores the code files.

  • DataPreprocessing

    Jupyter notebooks for preprocessing individual datasets. The processed data generated by these scripts is saved in the dataset/pre_data/scRNAseq_datasetsdirectory.

    Gene_interaction.ipynb: Jupyter notebooks for preprocessing high-confidence gene interaction data. The processed data is saved in the dataset/pre_data/Networkdirectory.

    Mouse_genes_ncbi2.ipynb: Jupyter notebooks for preprocessing mouse genes.

  • data_partitioning.py

    This file stores the five-fold cross-validation splits for the corresponding dataset. The generated files will be stored in the dataset/5fold_data folder.

  • gene_filter.py

    This file can store the indices of the highly variable genes (eg. 2000) in the gene expression matrix for each dataset.

  • up_sample.py

    This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as *_train_index_imputed.npy in the corresponding scRNA-seq dataset directory under dataset/5fold_data/.

  • wcsn_constr_train.py

    This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set.

  • wcsn_constr_test.py

    This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold testing set.

  • model.py

    This file contains the code for the WCSGNet model.

  • datasets_wcsn.py

    This file defines a custom PyTorch Geometric dataset class MyDataset, which is designed to handle graph data. There is no need to supplement the WCSN; it directly reads the WCSN data.

  • datasets_LT.py

    This file defines a custom PyTorch Geometric dataset class MyDataset2, which is designed to handle graph data. Its main functionalities include applying a logarithmic transformation to the WCSN weights.

  • datasets_BT.py

    This file defines a custom PyTorch Geometric dataset class MyDataset2, which is designed to handle graph data. Its main functionalities include using the binary transformation assign 1 to all edges.

  • wcsn_classify_train.py

    This step generates the 5-fold training set models using WCSN and saves them in result/models.

  • wcsn_classify_test.py

    This step generates the predicted results for the testing sets using WCSN and saves them in result/preds.

  • LT_wcsn_classify_train.py

    This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/models_LT.

  • LT_wcsn_classify_test.py

    This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/preds_LT.

  • BT_wcsn_classify_train.py

    This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in result/models_BT.

  • BT_wcsn_classify_test.py

    This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in result/preds_BT.

2.2 data

Storage of Downloaded Raw Data

  • scRNAseq_Benchmark_datasets

    The downloaded scRNA-seq datasets include: Muraro, Segerstolpe, Zheng 68k, Zhang_T, Kang, Baron, AMB, and TM.

2.3 dataset

Storing preprocessed data, five-fold splits of the running data, Entrez Gene IDs of genes, selected highly variable genes, generated high-confidence interaction subnetworks, WCSN data, etc.

  • pre_data

    scRNA-seq_datasets: Preprocessed scRNA-seq datasets generated using the .ipynb files located in the src/DataProcessing directory.

  • 5fold_data

    Store the data generated during the processing of each scRNA-seq dataset. This includes the five-fold splits of the dataset, the filtered list of highly variable genes, the indices obtained from up-sampling thetraining set, and the WCSNs generated for each fold of the training and testing sets.

2.4 result

  • Figures

    Storage result diagram.

  • models

    This folder contains the trained models and the models obtained from each fold of the cross validations by src/wcsn_classify_train.py

  • models_LT

    This folder contains the trained models and the models obtained from each fold of the cross validations by src/LT_wcsn_classify_train.py

  • models_BT

    This folder contains the trained models and the models obtained from each fold of the cross validations by src/BT_wcsn_classify_train.py

  • preds

    This folder contains the predicted results generated by src/wcsn_classify_test.py

  • preds_LT

    This folder contains the predicted results generated by src/LT_ewcsn_classify_test.py.

  • preds_BT

    This folder contains the predicted results generated by src/BT_ewcsn_classify_test.py.

3. Workflow

3.1 Data Collection and Preprocessing

3.1.1 Download scRNA-seq dataset

Muraro, Segerstolpe, Zheng 68k, Baron, AMB, and TM: Available for direct download from Zenodo.

Zhang T: Accessible via GEO under accession number GSE108989.

Kang: Accessible via GEO under accession number GSE96583.

Save to the data/scRNAseq_Benchmark_datasets directory.

3.1.2 scRNA-seq preprocessing
  • Initial preprocessing

    In the GitHub project's section 3.1.1, the scRNA-seq datasets are downloaded and require initial preprocessing using the ipynb files in the src/DataProcessing directory. The preprocessing steps include:

    1. Filtering out cell types with fewer than 10 cells and cells with unclear annotations.
    2. Filtering out genes expressed in fewer than 10 cells. After preprocessing, the resulting data should be saved in the dataset/pre_data/scRNA-seq_datasets directory.
  • Five-Fold Cross-Validation Splits

! python src/data_partitioning.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
  • -outdir: Default='dataset/5fold_data/', Specify the output directory.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step generates a seq_dict.npz file for each dataset located in the dataset/5fold_data/ directory. These files are used to store the five-fold cross-validation splits for the corresponding dataset, ensuring consistent and reproducible training and evaluation.

  • Up-sampling

    ! python src/up_sample.py

    Optional parameters

    • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.

    • -outdir: Default='dataset/5fold_data/', Specify the output directory.

    • --n_splits: Default=5, Indicates Five-fold cross-validation.

    This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as *_train_index_imputed.npy in the corresponding scRNA-seq dataset directory under dataset/5fold_data/.

  • Selection of highly variable genes(HVGs)

    ! python src/gene_filter.py

    Optional parameters

    • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.

    • -outdir: Default='dataset/5fold_data/', Specify the output directory.

    • -hvgs: Default=2000, Specify the number of HVGs.

    This step generates a .npy file for each dataset, containing 2000 HVGs. The file stores the indices of the highly variable genes in the gene expression matrix.

3.2 WCSN Construction

3.2.1 WCSN construction for reference dataset

! python src/wcsn_constr_train.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
  • -outdir: Default='dataset/5fold_data/'. Specify the output directory.
  • -cuda: Default=True.
  • -hvgs: Default=2000, The number of HVGs.
  • -ca: Default=0.01, Significance level.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set. The graph for each training set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., train_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in dataset/5fold_data/.

3.2.2 WCSN construction for query dataset

! python src/wcsn_constr_test.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
  • -outdir: Default='dataset/5fold_data/'. Specify the output directory.
  • -cuda: Default=True.
  • -hvgs: Default=2000, The number of HVGs.
  • -ca: Default=0.01, Significance level.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step constructs a 5-fold WCSN based on highly variable genes for each scRNA-seq dataset's testing set. The graph for each testing set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., test_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in dataset/5fold_data/.

3.3 Training and Prediction using

3.3.1 Training and prediction using WCSN

Training

! python src/wcsn_classify_train.py

Optional parameters

  • -expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.

  • -outdir: Default='result/models'. Specify the output directory.

  • -ca: Default=0.01, Significance level.

  • -hvgs: Default=2000, The number of HVGs.

  • -bs: Default=32, The batch size of this training.

This step generates the 5-fold training set models using WCSN and saves them in result/models.

Testing

! python src/wcsn_classify_test.py

Optional parameters

  • -expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.

  • -outdir: Default='result/models'. Specify the output directory.

  • -ca: Default=0.01, Significance level.

  • -hvgs: Default=2000, The number of HVGs.

  • -bs: Default=32.

This step generates the predicted results for the testing sets and saves them in result/preds.

The results include:

*_Prediction.h5: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.

*_F1.csv: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.

3.3.2 Training and prediction using WCSN(logarithmic transformation)

Training

! python src/LT_wcsn_classify_train.py

The optional parameters are mostly the same as those in wcsn_classify_train.py.

This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/models_LT.

Testing

! python src/LT_wcsn_classify_test.py

The optional parameters are mostly the same as those in wcsn_classify_test.py.

This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/preds_LT. The results include *_Prediction.h5 and *_F1.csv.

3.3.4 Training and prediction using WCSN(binary transformation)

Training

! python src/BT_wcsn_classify_train.py

The optional parameters are mostly the same as those in wcsn_classify_train.py.

This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in result/models_BT.

Testing

! python src/BT_wcsn_classify_test.py

The optional parameters are mostly the same as those in wcsn_classify_test.py.

This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in result/preds_BT. The results include *_Prediction.h5 and *_F1.csv.

4. Figures in this study

All drawing codes are from src/Figures/

  • Figure 2

    src/Figures/Figure2.ipynb

    Sankey diagram of the different datasets under WCSGNet's 5-fold cross-validation.

  • Figure 3(A) and (B)

    src/Figures/Figure3AB.ipynb

    Performance of WCSGNet with different edge weight representation methods, including the original method, binary transformation and binary transformation .

  • Figure 4(A-N)

    BaronHuman_analysis.ipynb

    src/Figures/R/Figure4A-M.R

    Top degree gene analysis of WCSN for different cell types on the Baron Human dataset

  • Figure 5(A-N)

    BaronHuman_analysis.ipynb

    src/Figures/R/Figure5A-M.R

    Top high-weight edges analysis of WCSNs for different cell types in the Baron Human dataset.

  • Figure 6(A-N)

    src/Figure/Figure6.py

    T-SNE visualization and feature analysis of the Baron Human dataset using WCSGNet.

  • Figure 7(A-H)

    src/Figures/AMB_analysis.ipynb

    src/Figures/R/Figure7A-D.R

    src/Figures/R/Figure7E-H.R

    Analysis of top degree genes and high-weight edges in WCSN for Different Cell Types on the AMB Dataset.

  • Figure 8(A) and (B)

    src/Figures/Figure7AB.ipynb

    Mean F1-score and accuracy comparison of WCSGNet on Muraro and Baron Mouse datasets curated with varying numbers of HVGs.

5. Repeatability

The following factors may result in slight differences in the Mean F1-score and Accuracy for cell type classification when reproducing the results, compared to those reported in the paper.

  1. The DataLoader applies a shuffle operation on the training dataset during model training, leading to some randomness in the input sequence of the training data.
  2. The use of the Dropout mechanism in the model introduces variability in the trained models across different runs.
  3. Parameter initialization also produces some randomness.

However, these differences do not have a disruptive impact on the conclusions of the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published