Skip to content

Latest commit

 

History

History
105 lines (62 loc) · 7.33 KB

README.md

File metadata and controls

105 lines (62 loc) · 7.33 KB

DEEPScreen: Virtual Screening with Deep Convolutional Neural Networks Using Compound Images-with test scripts

M. Volkan Atalay

DEEPScreen is a large-scale drug-target interaction (DTI) prediction system, for early-stage drug discovery, using deep convolutional neural networks.

One of the main advantages of DEEPScreen is employing readily available 2-D structural representations of compounds at the input level instead of conventional descriptors. DEEPScreen learns complex features inherently from the 2-D representations, thus producing highly accurate predictions for virtual screening. DEEPScreen was developed using PyTorch framework.

More information can be obtained from DEEPScreen journal article.

What is new?

In the original developed code for DEEPScreen, the model is trained and tested using a single input file; that is, all of the training data and test data are stored in the same file and each time the script is executed, training has to be performed also.

This new version allows to perform tests (prediction/virtual screening) separate from training (using an already trained model).

Here, I explain the newly added functionalities and functions.

General Information

DEEPScreen is a command-line prediction tool written in Python. The original repository came with a bundle of data and code, which I recently extended to include separate testing of a trained model. Here is the current directory structure:

  • bin: source code including original and new script files (main_test.py and test_DEEPScreen.py)
  • test_files: input test file(s) (this is the new added directory)
  • training_files: files used in the training and test
  • result_files: results of various tests/analyses
  • trained_models: already trained models

Training and Model Generation

Training is explained in the original repository under the title How to train DEEPScreen models and get performance results

Remark that after training, the trained model is stored (serialized) in a file entitled

targetid_best_val-targetid-<hyperparameters_seperated by dash>-<experiment_name>-state_dict.pth

under trained_models/<experiment_name>/

The following is an example call for main_training.py script to perform training for CHEMBL210 as the target protein.

python main_training.py --targetid CHEMBL210 --model CNNModel1 --fc1 256 --fc2 128 --lr 0.01 --bs 64 --dropout 0.25 --epoch 100 --en my_chembl210_training

This command generates a file (trained_models/my_chembl210_training/CHEMBL210_best_val-CHEMBL210-CNNModel1-256-128-0.01-64-0.25-100-my_chembl210_training-state_dict.pth) that contains a serialized PyTorch state dictionary. It is a Python dictionary that contains the state of a PyTorch model, including the model's weights, biases, and other parameters.

Restoring the Trained Model and Tests

How to test DEEPScreen model and get predictions

  1. Clone this Git Repository
  2. Download the compressed file for the chemical representations of compounds in ChEMBLv32 from here
  3. Move the compressed file under test_files/ and unzip it
  4. Prepare a file containing ChEMBL identifiers of compounds to be tested as explained below
  5. Run the main_test.py script as shown below

By executing main_test.py, the model for a target protein is restored and it can be used to screen (test or make a prediction for) a compound or a list of compounds.

main_test.py calls test_DEEPScreen function, which first parses the input test file and generates 2D images of the compounds listed in the test file. The trained model is then restored, and the predictions for the test compounds are obtained.

The following is an example call for main_test.py script to perform tests against CHEMBL210 using the model generated by the example call for main_training.py script to conduct training for CHEMBL210 as the target protein.

python main_test.py --targetid CHEMBL210 --modelfile DEEPScreen/trained_models/my_chembl210_training/CHEMBL210_best_val-CHEMBL210-CNNModel1-256-128-0.01-64-0.25-100-my_chembl210_training-state_dict.pth --testfile CHEMBL210_compounds.tsv

Here is the explanation of the parameters.

--targetid target to be trained

--modelfile trained model

--testfile compounds/drugs to be tested

The file containing the compounds/drugs to be tested should be placed under test_files directory and its format is as follows. A line starts with targetid_act followed by a tab delimiter and a list of ChEMBL identifiers of active compounds, separated by a comma. Similarly, inactive compounds should be given in a separate line starting with targetid_inact followed by a tab delimiter and a list of ChEMBL identifiers of inactive compounds, separated by a comma. If no activity information is known a priori, the user can insert the compounds in any of the two lists (in this case the user should ignore the performance evaluation scores).

An example for CHEMBL210 is given below.

CHEMBL210_act	CHEMBL2111083,CHEMBL1084173,CHEMBL1095607,CHEMBL521589,CHEMBL3039518,CHEMBL1240967,CHEMBL1291,CHEMBL1290,CHEMBL471,CHEMBL27810,CHEMBL4297483,CHEMBL1095777,CHEMBL1002,CHEMBL926,CHEMBL605846,CHEMBL1363,CHEMBL1198857,CHEMBL649,CHEMBL1201295,CHEMBL714,CHEMBL1094785,CHEMBL776,CHEMBL160519,CHEMBL88055,CHEMBL1094966,CHEMBL2012520,CHEMBL546,CHEMBL1201237,CHEMBL1201273,CHEMBL83063,CHEMBL1760,CHEMBL49080,CHEMBL1197051,CHEMBL434394,CHEMBL768,CHEMBL27193,CHEMBL16476,CHEMBL1201213,CHEMBL500,CHEMBL32800,CHEMBL1263,CHEMBL499,CHEMBL1159717,CHEMBL321582,CHEMBL631,CHEMBL27,CHEMBL1940832,CHEMBL3039530,CHEMBL1256786
CHEMBL210_inact

The output is displayed on the screen. First, the values of performance measures are displayed, and the list of compounds is given with their actual labels (if available) and predicted activity (1 for active and 0 for inactive).

Articles

If you use DEEPScreen please consider citing:

Rifaioglu, A. S., Nalbat, E., Atalay, V., Martin, M. J., Cetin-Atalay, R., & Doğan, T. (2020). DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chemical Science, 11(9), 2531-2557.

There is also an article in Medium entitled A Deep Learning-based Tutorial for the Early Stages of Drug Discovery

License

DEEPScreenWithTest Copyright (C) 2023 CanSyL and M Volkan Atalay

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.