This is my Kaggle Notebook for QSAR Androgen Receptor Data Set
QSAR Androgen Receptor Data Set contains binder/positive (199) and non-binder/negative (1488) molecules (data instances). Therefore, the dataset is imbalanced (with a ratio of 7.5). Here, I use shallow binary classifiers and evaluate their performances on two configurations of the dataset: (Configuration I) stratified 5-fold cross-validation (as mentioned in the original paper) and (Configuration II) balanced dataset obtained by using Butina clustering on the molecules.
This is a binary classification problem: we need to predict whether a given molecule (data instance) is a binder to the Androgen Receptor or not.
Each data instance corresponds to a molecule. A molecule is represented by molecular fingerprints consisting of 1024 bits. The last bit in each row is the target label: binder/positive or non-binder/negative.
This is the configuration where training, validation, and test dataset splits are generated by keeping the same percentages of data instances of the two classes (binders/positive and non-binders/negative) in each split. However, in this case, since dataset is initially imbalanced, this imbalance between two classes will be kept in the splits as well.
A. Data is split into a training dataset and an independent test dataset. Independent test dataset is only used for tests while 5-fold cross-validation is applied with the training dataset.
B. Since the attributes (fingerprints) are in the first 1024 columns of a row (data instance/molecule) and the last column indicates the label (positive or negative), these are separated and stored in different lists.
C. The labels indicated as positive or negative are then encoded by a bit (0 or 1).
D. Support Vector Machine (SVM) is employed as a shallow classifier. The hyperparameter values of the SVM are tuned by using 5-fold cross-validation and grid searh on a selected set of hyperparameters.
E. The SVM-based classifier model is then evaluated in terms of a few metrics by using 5-fold cross-validation.
F. An SVM-based classifier model is subsequently obtained using the tuned hyperparameter values and the whole training dataset. The classifier model is used to get predictions for the independent test dataset. The performance of the classifier model is evaluated both on the training dataset and on the independent test dataset.
This is the configuration where training, validation, and test dataset splits are generated by keeping the same number of data instances of the two classes (binders/positive and non-binders/negative) in each split. In this case, since dataset is initially balanced, this balance between two classes will be kept in the splits as well.
A. rdkit package is used for the Butina clustering algorithm. rdkit and some other packages are imported.
B. Butina clustering algorithm is implemented as a function that inputs directly the fingerprints of a molecule. Butina clustering algorithm uses Tanimoto similarity between two fingerprints.
C. The binders/positive and non-binders/negative compounds are grouped separately.
D. Preprocessing is applied on binders/positive and non-binders/negative data instances.
E. Butina clustering algorithm is applied separately on the positive instance group and negative instance group. The labels indicated as positive or negative are then encoded by a bit (0 or 1).
F. Cluster centers and their corresponding molecules' fingerprints are stored.
G. Data instances belonging to the positive set are split into training dataset and test dataset. The same procedure is applied for the data instances belonging to negative set. From these, the final training dataset and final test dataset are formed.
H. Output labels (0 and 1) are associated.
I. The SVM-based classifier model is evaluated in terms of a few metrics by using 5-fold cross-validation.
J. The SVM-based classifier model is subsequently obtained using the whole training dataset. The classifier model is used to get predictions for the independent test dataset. The performance of the classifier model is evaluated both on the training dataset and on the independent test dataset.