PlantBind: An attention-based multi-label neural networks for predicting transcription factor binding sites in plants
Identify of transcription factor binding sites (TFBSs) are essential for the analysis of gene regulation. Here, we present PlantBind, a method for the integrated prediction and interpretation of TFBS events from DNA sequences and DNA shape profiles. Built upon an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the binding sites of 315 TFs. As shown in Figure 1, the hybrid model consists of data processing module, embedding module, feature extraction module, and multi-label output module.
Fig. 1 The model workflow
The model provides researchers with tools to:
- Identify the TFBSs of transcription factors.
- Identify DNA-binding motfi of transcription factors
These are a work in progress, so forgive incompleteness for the moment. If there's a task that you're interested in that I haven't included, feel free to post it as an Issue at the top.
We recommend that you use conda to install all of the following software.
software list
- python v3.8.12
- pytorch v1.10.2
- numpy v1.16.4
- pandas v1.4.1
- sklearn v1.0.2
- scipy v1.5.3
- matplotlib v3.5.1
In this part, we will first introduce the data information used in this model, then introduce the training and predicting data formats, and finally introduce how to create a data set that meets the model requirements for prediction.
All data is in the data directory:
- Ath-TF-peaks: the TFBS peak info of 315 Ath TFs, and one neg.bed file
- Maize-TF-peaks: the TFBS peak info of 4 maize TFs for trans-specise
- model: The file that holds the model, which can be loaded to predict new datasets
For training, the data mainly consists of three files: (1)DNA sequence file; (2)DNA shape file; (3)data label file
For predicting, the data mainly consists of three files: (1)DNA sequence file; (2)DNA shape file
Next, we will mainly introduce how to create the files mentioned above.
- Training
Input:train_sequence.table
,train_DNA_shape.npy
,train_label.txt
.
All data files need to be placed in the same folder before starting training, such asdata_folder
Output:trained_model_101_seqs.pkl
python -m torch.distributed.launch --nproc_per_node=8 \
main.py --inputs data_folder/ --length 101 \
--OHEM True --focal_loss True \
--batch_size 1024 --lr 0.01 \
--multi_GPU True
- Testing
Input:test_sequence.table
,test_DNA_shape.npy
,test_label.txt
.
All data files need to be placed in the same folder before starting testing, such asdata_folder
Model:trained_model_101_seqs.pkl
Output:metrics_b.json
,metrics_m.json
,Gmean_threshold_b.txt
,Gmean_threshold_m.txt
,precision_recall_curve_TFs_315.pdf
,roc_prc_curve_TFs_315.pdf
python main.py --inputs data_folder/ --length 101 --mode test
Input: sequence.table
,DNA_shape.npy
Output: accuracy
The analysis is done by using the inference.py script and modifying the relevant parameters.