This study is inspired by the DeepChem Tutorial "Transfer Learning with ChemBERTAa Transformers" [1], in which ChemBERTa model, pre-trained on 77M SMILES from PubChem [2] was used for fine-tuning on molecules' toxicity task.
In this project, another version of the model, i.e. ChemBERTa-2 [3-5] is fine-tuned for HIV replication inhibition prediction (Fig. 1) using MoleculeNet Dataset [6]. Specifically, the influence of the pre-training method on the performance of the downstream task after fine-tuning is investigated. The model pre-trained with masked-language modeling (MLM) achieved better performance (AUROC 0.793), than the model pre-trained with multi-task regression (MTR) (AUROC 0.733). The alterations in the distributions of molecular embeddings before and after fine-tuning highlight the improved capacity of models to distinguish between active and inactive HIV molecules.
S. Nowakowska, ChemBERTa-2: Fine-Tuning for Molecule’s HIV Replication Inhibition Prediction
Fig. 1) Study design
Fig. 2) Models' performance
Fig. 3) Latent representations of the embeddings of the molecules contained in the test set for both MLM and MTR models prior and after fine-tuning
References:
[1] Transfer Learning with ChemBERTAa turorial
[2] S. Chithrananda et al., ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
[3] W. Ahmad et al., ChemBERTa-2: Towards Chemical Foundation Models
[4] HuggingFace, DeepChem: ChemBERTa-77M-MLM
[5] HuggingFace, DeepChem: ChemBERTa-77M-MTR
[6] MoleculeNet Dataset