Skip to content

Latest commit

 

History

History
40 lines (34 loc) · 2.84 KB

File metadata and controls

40 lines (34 loc) · 2.84 KB

LLM_Fine_Tuning_Molecular_Properties

Fine-tuning of ChemBERTa-2 for the HIV replication inhibition prediction.

PWC

This study is inspired by the DeepChem Tutorial "Transfer Learning with ChemBERTAa Transformers" [1], in which ChemBERTa model, pre-trained on 77M SMILES from PubChem [2] was used for fine-tuning on molecules' toxicity task.

In this project, another version of the model, i.e. ChemBERTa-2 [3-5] is fine-tuned for HIV replication inhibition prediction (Fig. 1) using MoleculeNet Dataset [6]. Specifically, the influence of the pre-training method on the performance of the downstream task after fine-tuning is investigated. The model pre-trained with masked-language modeling (MLM) achieved better performance (AUROC 0.793), than the model pre-trained with multi-task regression (MTR) (AUROC 0.733). The alterations in the distributions of molecular embeddings before and after fine-tuning highlight the improved capacity of models to distinguish between active and inactive HIV molecules.
S. Nowakowska, ChemBERTa-2: Fine-Tuning for Molecule’s HIV Replication Inhibition Prediction


Alt text
Fig. 1) Study design


Alt text
Fig. 2) Models' performance


Alt text
Fig. 3) Latent representations of the embeddings of the molecules contained in the test set for both MLM and MTR models prior and after fine-tuning

References:
[1] Transfer Learning with ChemBERTAa turorial
[2] S. Chithrananda et al., ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
[3] W. Ahmad et al., ChemBERTa-2: Towards Chemical Foundation Models
[4] HuggingFace, DeepChem: ChemBERTa-77M-MLM
[5] HuggingFace, DeepChem: ChemBERTa-77M-MTR
[6] MoleculeNet Dataset