A machine learning approach to predict the recombinant expressibility of cysteine-dense peptides in mammalian cells based on their primary sequence, compatible with multiple types of deep learning protein representations.
CysPresso: a classification model utilizing deep learning protein representations to predict recombinant expression of cysteine-dense peptides https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05327-8
The raw data (protein representations) used in this work can be obtained here: https://huggingface.co/datasets/TonyKYLim/CysPresso/tree/main
You can try out any sequence using BringYourOwnSequences-CysPressoESM.ipynb.
Our method's notebook can be opened through Colab at https://githubtocolab.com/Zebreu/cyspresso/blob/main/CysPresso.ipynb where you can explore both the dataset and the methodology we used for this study. You can upload CDPs.csv provided in this repo and any embeddings you generated to the colab workspace to run the notebook.
BringYourOwnSequences-CysPressoESM.ipynb uses the ESM2 650M model for a quick, interactive demonstration.
You can use https://github.com/sokrypton/ColabFold to generate embeddings saved as npy files in the colab workspace. In the "Run Prediction" code cell, use the following arguments:
run(
queries=queries,
result_dir=result_dir,
use_templates=use_templates,
custom_template_path=custom_template_path,
use_amber=use_amber,
msa_mode=msa_mode,
model_type=model_type,
num_models=5,
num_recycles=num_recycles,
model_order=[1, 2, 3, 4, 5],
is_complex=is_complex,
data_dir=Path("."),
keep_existing_results=False,
recompile_padding=1.0,
rank_by="auto",
pair_mode=pair_mode,
stop_at_score=float(100),
prediction_callback=prediction_callback,
save_single_representations=True,
save_pair_representations=True,
dpi=dpi
)
It is also possible is to use OpenFold with our modified notebook at https://githubtocolab.com/Zebreu/cyspresso/blob/main/openfoldrepresentations.ipynb.
Cells after the prediction cell show where to find the embeddings, e.g. all_results['results']['single']
.
OpenFold might generate predictions more slowly than ColabFold.
The cysteine-dense peptide database was modified from Correnti, CE et al. Nat Struct Mol Biol. 2018.