This repository contains source code of scMeformer for single cell DNAm data imputation.
scMeformer package requires only a standard computer with GPUs and enough RAM to support the in-memory operations.
This package is supported by Linux. The package has been tested on Rocky Linux 9.2.
scMeformer mainly depends on the following Python packages.
PyTorch
apex
numpy
scipy
scikit-learn
pandas
loompy
json
h5py
Clusters all cells for a brain sample (Br1092)
$python clustering.py Br1092
Construct training data and calculate cluster features for a brain sample (Br1092)
$python run_feature.py Br1092
2.1. Train DNAm prediction model using single cell data. We trained a prediction model for each brain sample.
train the DNAm prediction model for one brain sample (Br1092) using four GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch main.py transformer single_cell_regression \
--exp_name single_cell_regression \
--learning_rate 0.000176 \
--batch_size 128 \
--data_dir ./datasets/Schizo_Control/Br1092 \
--output_dir ./outputs/Schizo_Control/Br1092 \
--warmup_steps 10000 \
--gradient_accumulation_steps 1 \
--fp16 --local_rank 0 \
--nproc_per_node 4 \
--model_config_file ./config/config.json
"num_features" and "task_size" in "./config/schizo_control/Br1092/config.json" denote the numbers of clusters and cells in the brain sample (Br1092).
predict DNAm levels of CpGs for all cells in a brain sample (Br1092) using the trained model
CUDA_VISIBLE_DEVICES=0 python3 main.py transformer single_cell_prediction \
--exp_name single_cell_prediction \
--batch_size 1024 \
--learning_rate 0.000176 \
--fp16 \
--warmup_steps 10000 \
--gradient_accumulation_steps 1 \
--data_dir ./datasets/genome_cpg \
--output_dir ./outputs/prediction/schizo_control/Br1092/chr1 \
--num_train_epochs 500 \
--model_config_file ./config/config.json \
--from_pretrained ./outputs/Schizo_Control/Br1092 \
--split chr1
4.1.1 ./scMeformer/datasets/methylation_data/chr1.json contains 10000 CpGs and each CpG have five keys:
(1) chromosome
(2) position
(3) strand
(4) cells that are methylated at the CpG site
(5) that are unmethylated at the CpG site
4.1.2 ./scMeformer/datasets/feature_data/chr1.npy contains DNAm levels of 89 clusters for above 10000 CpGs
4.1.3 ./scMeformer/datasets/genome/chr1.npy contains one-hot encoded genome sequences covering above 10000 CpGs
4.1.4 ./scMeformer/datasets/position/chr1.npy provides the index of each CpG in ./scMeformer/datasets/feature_data/chr1.npy by its position.
4.2 Validation and test data include CpGs on chromosomes 21 and 22, respectively. Both include four files of the same format with training data.
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch main.py transformer single_cell_regression \
--exp_name single_cell_regression \
--learning_rate 0.000176 \
--batch_size 128 \
--data_dir ./datasets/ \
--output_dir ./scMeformer/outputs/demo_model \
--warmup_steps 10000 \
--gradient_accumulation_steps 1 \
--fp16 --local_rank 0 \
--nproc_per_node 4 \
--model_config_file ./config/config.json