homology.qmd

---
title: "Homology Modeling"
date: "`r Sys.Date()`"
toc: true 
toc_float: true
format:
  html:
    theme: simplex
    toc: true
    toc-location: right
    toc-depth: 4
    number-sections: true
    code-overflow: wrap
    link-external-icon: true
    link-external-newwindow: true

bibliography: references.bib
editor_options: 
  markdown: 
    wrap: 80
---

# Modeling protein structures: Mind the gap

The number of protein sequences in databases growth exponentially during the
last decades, particularly after the revolution of high throughput sequencing
methods.

[![Number of entries in UniProt/TrEMBL (release
2022_2).](pics/trembl_stats.png "Number of entries in UniProt/TrEMBL"){#fig-seqs
.figure}](https://www.ebi.ac.uk/uniprot/TrEMBLstats)

However, experimental determination of 3D protein structures is often difficult,
time-consuming, and subjected to limitations, such as experimental error, data
interpretation and modeling new data on previously released structures. Thus,
despite substantial efforts that started at the beginning of the 21st century to
implement high-throughput structural biology methods [see for instance
@manjasetty2008], the availability of protein structures is more than 1,000
times less than the number of sequences (>226M sequences for Uniprot and ∽195k
structures in RCSB Protein Databank in August 2022). This difference is called
the *protein structure gap* and it is constantly widening [@muhammed2019].

[![Number of macromolecular structures in RCSB PDB database (accessed 6th July
2022).](pics/RCSB_stats.png "Number of macromolecular structures in RCSB PDB database."){#fig-structures
.figure}](https://www.rcsb.org/stats/growth/growth-released-structures)

Thus, an accurate prediction of the 3D structure of any given protein is needed
to make up for the lack of experimental data.

## Are protein structures predictable at all?

Amino acid properties determine the phi and psi angles that eventually shape the
higher structural levels. Protein folding might be more complex though, as it
should be coupled to protein synthesis.

One can think that complexity and diversity of protein structures in the nature
may be huge. Indeed, upon their determination of the three-dimensional globular
structure (myoglobin, 1958), John Kendrew and his coworkers seemed very
disappointed [@kendrew1958]:

*Perhaps the most remarkable features of the molecule are its complexity and its
lack of symmetry. The arrangement seems to be almost totally lacking in the kind
of regularities which one instinctively anticipates, and it is more complicated
that has been anticipated by any theory of protein structure.*

Not much later, in 1968, Cyrus Levinthal (1922--1990) published the so-called
*Levinthal's paradox*, stating that proteins fold in nano/milliseconds, but even
for small peptides it will take a huge time to test the astronomical number of
possible conformations. Say a 100 aa small protein; it will have 99 peptidic
bonds and 198 different phi and psi angles. Assuming only 3 alternative
conformations for each bond, it will yield 3^198^ (= 2.95 x 10^95^) possible
conformations. If we design a highly efficient algorithm that tests 1
conformation per nanosecond:

|  2.95 x 10^85^ secs = **9x10^67^ billions years**

Considering that the age of the universe is 13.8 billion years, predicting
protein structures does not seem an easy task.

In this context, a very simple experiment 50 years ago, led some light on the
protein folding mechanism. [Cristian
Anfisen](https://en.wikipedia.org/wiki/Christian_B._Anfinsen) was able to
completely denature (unfold) the Ribonuclease A, by the addition of reducing
agents and urea under heat treatment, and subsequently switch to normal
conditions that allow the protein to re-fold fully functional. This experiment
indicates that the amino acid sequence dictates the final structure.
Notwithstanding some relevant exceptions, this has been largely confirmed.

![The Anfinsen Dogma: Amino acid sequence dictated the final structure. From
@anfinsen1973 .](pics/anfinsen.jpg "Anfinsen"){#fig-anfinsen .figure}

One can imagine that *in vivo* native structures of proteins look-alike the
lowest free energy conformation, i.e., the global energy minimum. That is the
basis of the funnel model of protein folding, which assumes that the number of
possible conformations is reduced when a local energy minimum is achieved,
constituting a path for the folding process.

[![Schematic diagram of a protein folding energy landscape according to the
funnel model. Denatured molecules at the top of the funnel might fold to the
native state by a myriad of different routes, some of which involve transient
intermediates (local energy minima) whereas others involve significant kinetic
traps (misfolded states). From
.](pics/funnel_1.jpeg "The funnel of protein folding"){#fig-funnel
.figure}](https://www.sciencedirect.com/science/article/pii/S0968000400017072?casa_token=BR_maX_GNYYAAAAA:IZrQqI9jIbiv-ZeA6aOMHQxsVcp-wgy0XPFO3DRcFiAi7TSrX-3cc7Jb6dhTHdXSQseEhF3l#BIB49)

::: callout-important
### Important

In conclusion, prediction of protein structures is possible, as protein folding
relies only on the protein sequence, but it will require virtually infinite time
and computational resources.
:::

Homology modeling is one of the most convenient *tricks* to bypass that
limitation. Basically, the strategy is adding all the possible extra information
to the amino acid properties, namely the evolutionary conservation of sequences
and structures.

Very often, before generating models you already have some information about
your protein. For instance, if it is an enzyme you may have spotted the
catalytic residues or substrate-interaction region. It is also advisable to
check the literature, particularly if there is a companion paper of the related
PDB structure(s) that may be available or that you can eventually find and use
as the template(s) for modeling.

# Homology modeling: Conservation is the key

![Protein modeling in a nutshell is about going from the sequence to the
structure. But, how do we do
it?](pics/modeling.png "Protein modeling"){#fig-nutshell .figure}

Traditionally and conceptually, the prediction of protein structures can be
addressed from two different perspectives: comparative modeling and *ab initio*
prediction. In the comparative modeling approach, one can predict 3D structure
for protein sequences only if their homologs are found in the database of
proteins with known structures. Obviously, the identification of such homologs
is the keyword here. Until recently, most of the evolution in protein modeling
was driven by the evolution of methods to identify distant sequence similarities
that would reflex similar protein folds.

[![Two different approaches for structure
prediction.](pics/predway.jpeg "Two different approaches for structure prediction."){#fig-modelingways
.figure}](http://isw3.naist.jp/IS/Bio-Info-Unit/gogroup/study/study-en.html)

The accuracy of homology modeling will be limited by the availability of similar
structures, *ab initio* predictions are limited by the mathematical models and
computational resources, being often useful only with small peptides.

To maintain structure and function, certain amino acids in the protein sequence
suffer a stronger selective pressure, evolving either slower than expected or
within specific constraints, such as chemical similarity (i.e., conservative
substitutions). Thus, homology modeling approaches assume that a similar protein
sequence involves a similar 3D structure and function. At the same time,
although the number of published sequences and structures continually increased,
the number of unique folds remains almost steady since 2008.

![Growth of the protein structure database since its inception in 1974. From
[Bonvin lab
site](https://www.bonvinlab.org/education/molmod_online/modelling/).](pics/rcsb-statistics.png "Protein structures and unique folds"){#fig-bovin
.figure}

That implies that the space of protein sequences is much larger than the space
of structures. This has been exploited by some structural databases, like
[Scop2](https://scop.mrc-lmb.cam.ac.uk/) or [CATH](https://www.cathdb.info/),
that use hierarchical classifications of structures into very few categories of
different structures.

![[CATH](https://www.cathdb.info/) main categories exemplified the limited
structures space, with 86k vs 500k domains in 2006 and 2022, respectively, but
40 and 41 architectures in the same years.](pics/cath.png "CATH"){#fig-cath
.figure width="632"}

In conclusion, protein structures are more conserved than sequences, which
allows for model construction by comparison of related proteins. Biological
sequences evolve through mutation and selection and selective pressure is
different for each residue position in a protein according to its structural and
functional relevance. Sequence alignments attempt to tell us the evolutionary
story of proteins.

# Homology modeling in four steps

![Workflow of template-based protein modeling. From Expasy [Protein Structure,
Comparative Protein Modelling, and Visualisation online
course](https://swissmodel.expasy.org/static/course/files/PartII_homology_modelling.pdf).](pics/homology_simple_workflow.png "Homology Modeling workflow"){#fig-modelingsteps
.figure width="397"}

A basic workflow of homology modeling requires three steps: (1) identification
of the best-suitable template, (2) alignment of the query sequence and the
template, and (3) construction of the model. These steps can be addressed with
diverse methodological alternatives and can provide different outputs that must
be assessed (step 4) in order to decide the best solution for each step. For a
more detailed review on homology modeling, I suggest reading @haddad2020.

In this course, we are focusing on end-to-end modeling with
[SWISS-MODEL](https://swissmodel.expasy.org/) [@waterhouse2018], a fully
automated modeling server that allows you the construction of models without the
requirement of a strong bioinformatics background or coding skills. The early
versions of SWISS-MODEL only allowed modeling sequences with close homologs in
databases but, as discussed below, implementation of advances in template
recognition and model building, increased the modeling proficiency and accuracy,
particularly during the last decade.

SWISS-MODEL has different supported inputs. Usually, you provide a single
sequence query for template searching, but you can bypass this step and provide
directly the template you want to use and even a template and a custom
alignment. We will start with the first alternative.

#### [In-class Homology Modeling exercise: Quick Modeling with SWISS-MODEL.]{style="color:green"}

As an exercise, we are modeling a human DNA repair protein, NEIL2, and a viral
DNA polymerase (HAdV-2 pol). We will intersperse discussion about modeling with
the exercise steps.

::: callout-note
#### Rather than provide just a recipe to make a model in two clics, the goal is to understand the process behind, while doing exercises with selected examples
:::

## Step 1+2. Template search and align. {#template}

### Where can we search?

Template searching consists in finding a protein with known structure that has a
sequence similar to our protein. As we mentioned above, the [RCSB Protein Data
Bank](https://www.rcsb.org/) (PDB) is the largest database of protein
structures, so we can search templates by comparing the sequence of our protein
with the sequence of all the proteins in PDB. However, PDB was constructed to
contain all the macromolecular structures, not for searching templates for
modeling. Similar to other end-to-end software, SWISS-MODEL has its own curated
database, called **SMLT** (SWISS-MODEL template library). This is based on PDB,
updated weekly, and also annotated and indexed to boost searches. As of 6 July
2022, SMLT contains 133,049 unique protein sequences that map to 332,864
[biological units](https://proteopedia.org/wiki/index.php/Biological_Unit).

Now, you first need your sequences. A protein sequence database like
[Uniprot](https://www.uniprot.org/) or [NCBI
Protein](https://www.ncbi.nlm.nih.gov/protein/), is a good place for this task.

If you want to cheat yourself, I give you the
[HAdV-2pol](https://rest.uniprot.org/uniprotkb/P03261.fasta) and
[NEIL2](https://rest.uniprot.org/uniprotkb/Q969S2.fasta) sequences.

### How can we search accurately and fast?

![Query-template alignment is the base for homology modeling
[@kelley2009]](pics/template_query.png "Query-template alignment is the base for a new model"){#fig-query
.figure width="631"}

Finding templates require comparing sequences, thus an accurate and powerful
alignment method is essential. Comparing one protein sequence with a whole
database is time-consuming, as you will compare with totally unrelated proteins,
which is a loss of resources. Two basic improvements increased the template
search capacity, (1) the introduction of secondary structure (SS) by comparing
SS predictions of the query protein and the secondary structures of the protein
database, and (2) the use of profiles to make easier the comparison.
[Profiles](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/what-are-protein-signatures/signature-types/what-are-profiles/)
are a mathematical way to summarize a multiple sequence alignment in which the
frequency of each amino acid at each position is quantified. A particular type
of profiles, the [*Hidden Markov
Models*](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/what-are-protein-signatures/signature-types/what-are-hmms/)
(HMM) are very well suited to search databases for similar sequences. HMMs
include amino acid insertions and deletions, meaning that they can model entire
alignments, including divergent regions. This allows the identification of
highly conserved positions that not only define the protein function but also
the fold. For instance, glycine residues at the end of each beta-strand or a
pattern of polar residues that favors alpha-helices. A previous comparison of
the query sequence with a database of sequences will allow us to include
evolutionary information about it. Therefore, we moved from a requirement of
\>30% identity to obtain good models before the implementation of profiles, to
good models even with ⁓20% identity or below. Moreover, the generation of
profiles also facilitates clustering of the search database, reducing search
time. The implementation of these capacities led to the implementation of the
so-called **fold recognition** in homology modeling.

![From sequence vs. sequence search to profile-profile comparison
[@kelley2009].](pics/profiles.jpg "From sequence vs. sequence search to profile-profile comparison"){#fig-profiles
.figure width="669"}

Template searches in SWISS-MODEL have evolved through the years and currently,
this step is performed with HHblits [@remmert2011], a specific profile-profile
method. We could search for templates also with Blast or other profile-profile
methods, like
[Psi-BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROGRAM=blastp&BLAST_PROGRAMS=psiBlast),
[HHPred](https://toolkit.tuebingen.mpg.de/tools/hhpred), or
[JackHMMER](https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer). Then, templates
are ranked by two different numeric parameters ranging between 0 and 1: **GMQE**
(global model quality estimate) and **QSQE** (quaternary structure quality
estimate). Briefly, GMQE uses probability functions to assess several properties
of the target-template alignment (sequence identity, sequence similarity,
HHblits score, the agreement between predicted secondary structure of target and
template, the agreement between predicted solvent accessibility between target
and template; all normalized by alignment length) to predict the expected
quality of the resulting model. QSQE assesses the oligomeric state probability
of the model.

Note: If you click on "Build Model", it will directly use the top-ranked
template, so you'll miss some fun, but you can go back for that later on if you
change your mind.

[![SWISS-MODEL modeling start. Just paste your sequences and click on
Search.](pics/seach.png "SWISS-MODEL modeling"){#fig-swissmodel
.figure}](https://swissmodel.expasy.org/interactive)

#### [Could you foresee which of our queries will give rise to a better model? Why?]{style="color:green"}

## Step 3. Model Building.

By default, SWISS-MODEL will provide 50 ranked possible templates. The output
also contains information about the method and resolution of the templates, the
% of identity (and alignment coverage) with the query sequence, and the GMQE and
QSQE.

The top template is marked by default and it likely will give the best model,
but it is also interesting to try some alternative templates depending on the
downstream application of the model (see [below](#corollary)). For instance with
a different substrate/cofactor that can have a key role in the protein function
or with different coverage or % identity.

Once the template(s) is selected, model coordinates are constructed based on the
alignment of the query and template sequence using ProMod3 module [@studer2021].
SWISS-MODEL uses a fragment assembly, which is also the bases of
*Fold-recognition* or *Threading* methods (see Threading
[section](advanced.html#sec-threading)). Other programs, like
[Modeller](https://salilab.org/modeller/), are based in the satisfaction of
general spatial restraints [@janson2019]. Modeller is a command-line tool that
allow full customization of the modeling, which requires more knowledge about
the process but can be very useful for some types of proteins. However, it has
been implemented in some online servers
([ModWeb](https://modbase.compbio.ucsf.edu/modweb/)) and user-friendly
applications, including ChimeraX and Pymol (Pymod plugin, @janson2021). Modeller
can be also called from the HHPred output (if you included PDB as a searchable
database), which is very convenient to model remote homologs using several
templates in a few minutes.

Fragment assembly will use the template core backbone atoms to build a core
structure of the model, leaving non-conserved regions (mostly loops) for later.
**Loops modeling** includes the use of a homologs subset of a dedicated loop
database, [Monte Carlo](https://en.wikipedia.org/wiki/Monte_Carlo_method)
sampling as a fallback and even *ab initio* building or missing loops.

![Backbone and loop modeling. From [Expasy Protein Structure, Comparative
Protein Modelling, and
Visualisation](https://swissmodel.expasy.org/static/course/files/PartII_homology_modelling.pdf).](pics/loops.png "Backbone and loop modeling"){#fig-backbone
.figure}

Then, positioning of **side chain** of non conserved amino acids is undertook.
The goal is finding the most likely side chain conformation, using template
structure information, rotamer libraries (from a curated set known protein
structures) and energetic and packaging criteria. If many side chains have to be
placed in the structure it will lead to a "chicken and egg problem", as
positioning one rotamer would affect others. That means that identification of
possible hydrogen bonds between residues side chains and between side chains and
the backbone reduce the optimization calculations. At the end of the day, the
more residues correctly positioned, the best model.

![Side chain modeling. From [CMBI Seminars on
Bioinformatics](https://swift.cmbi.umcn.nl/teach/B1SEM/B1SEM_8.html).](pics/Rotamers.png "Backbone and loop modeling"){#fig-rotamers
.figure}

Finally, a short energy minimization is carried out to reduce the unfavorable
contacts and bonds by adapting the angle geometries and relax close contacts.
This energy minimization step or refinement can be useful to achieve better
models but only when the folding is already accurate.

## Step 4. Result assessing.

The computer always give you a model but it doesn't mean that you have a model
that makes sense. How can we know if we can rely on the model? Output models are
colored in a temperature color scale, from navy blue (good quality) to red (bad
quality). That can help us to understand our model in a first sight. Also, this
is an interactive site and you can zoom-in, zoom-out the model. Many other
features are available to work on your model. For instance, you can compare
multiple models, you can change the display options. You can also download all
the files and reports on the "Project Data" button.

[![NEIL2 model created with SWISS-MODEL (July
2022)](pics/neil2.png "NEIL2 model"){#fig-neil
.figure}](https://swissmodel.expasy.org/interactive/TWq8LD/models/)

There is also a "Structure Assessment" option. This provide you a detailed
report of the structural problems of your model. You can see [Ramachandran
plots](intro.html#sec-rama "Ramachandran plots in Wikipedia") that highlight in
red the amino acid residues with abnormal phi/psi angles in the model and a
detailed list of other problems.

The GMQE is updated to the QMEAN Zscore and QMEANDisCo [@studer2020]. The QMEAN
Z-score or the normalized QMEAN score indicates how the model is comparable to
experimental structures of similar size. A QMEAN Z-score around 0 indicates good
agreement, while score below -4.0 are given to models of low quality. Besides
the number, a plot shows the QMEAN score of our model (red star) within all
QMEAN scores of experimentally determined structures compared to their size.
Overall, the Z-score is equivalent to the standard deviation of the mean.

![Per-residue QMEANDisCo scores are mapped as red-to-green colour gradient on a
model of lbp-8 in Caenorhabditis elegans (UniProtKB: O02324, PDB: 6C1Z).
Distance constraints have been constructed from an ensemble of experimentally
determined protein structures that are homologous to lbp-8. The inset depicts
two example constraints between residues marked with colour-coded spheres in the
model. From @studer2020.](images/paste-C2BED0C6.png){#fig-disco .figure
width="572"}

The QMEANDisCO was implemented in SWISS-MODEL in 2020 and it is a powerful,
single parameter that combines statistical potentials and agreement terms with a
distance constraints (DisCo) to provide a consensus score. DisCo evaluates
consistencies of pairwise CA-CA distances from a model with constraints
extracted from homologous structures. All scores are combined using a neural
network trained to predict per-residue scores. We can check a global score, but
also a local score for each residue, that help us to understand which regions of
the model are more likely to accurately folded (i.e. they are more reliable).

[![HAdV-2 DNA polymerase model obtained with SWISS-MODEL (July
2022).](pics/hadv-pol.png "HAdV-2 DNA polymerase"){#fig-pol
.figure}](https://swissmodel.expasy.org/interactive/GZ8GmU/models/)

QMEANDisCo can be used to analyze models obtained with other methods in order to
make them comparable (note that you can use QMEANDisCo for models obtained with
any method, you just need a `.pdb` file). There are other independent model
assessing tools commonly used to assess protein models, like
[VoroMQA](https://bioinformatics.lt/wtsam/voromqa) \[@olechnovi2017\] or
[MoldFold](https://www.reading.ac.uk/bioinf/ModFOLD/ModFOLD8_form.html)
[@mcguffin2021]. VoroMQA is very quick method that combines the idea of
statistical potentials (i.e. a knowledge-based score function) with the use of
interatomic contact areas to provide a score in the range of \[0,1\]. When
applied to PDB database, most of the high-quality experimentally-based
structures have a VoroMQA score \>0.4. Thus, if the score is greater than 0.4,
then the model is likely good and models with score \<0.3 are likely bad ones.
Models with score 0.3-0.4 are uncertain and should not be classified with
VoroMQA. On the other hand, ModFold is a meta-tool that provides you a very
detailed report (and parseable files) with local and global scores, but it can
take hours/days to obtain the result, so tend to use it only with selected
models.

Another key parameter that you should know if you want to compare protein
structures is the **alpha carbon RMSD** (see [Structure alignment
section](ddbb.html#sec-alignment)). Any protein structural alignment will give
you this parameter as an estimation of the difference of the structures. You can
align structure with many online servers, like
[RCSB](https://www.rcsb.org/alignment), [FATCAT2](http://fatcat.godziklab.org/)
or using molecular visualization apps, like
[ChimeraX](https://www.cgl.ucsf.edu/chimerax/) or [PyMOL](https://pymol.org/2/)
(see [Protein Structure Display](pdb.html#sec-apps) section).

#### [Which model is better? Which regions are more difficult to model? Why?]{style="color:green"}

## Corollary: What can I do with my model and what I cannot? {#corollary}

[![Accuracy and application of protein structure models (in 2001). From
.](pics/sali.jpeg "Accuracy and application of protein structure models."){#fig-baker
.figure}](https://www.science.org/doi/10.1126/science.1065659)

A big power entails a big responsibility. The use of models entails a precaution
and a need for experimental validation. However, knowing the limitations of our
model is required for a realist use of it; and limits are defined by the model
quality.

The accuracy of a comparative model is related to the percentage sequence
identity on which it is based. High-accuracy comparative models can have about
1-2 Å root mean square (RMS) error for the main-chain atoms, which is comparable
to the accuracy of a nuclear magnetic resonance (NMR) structure or an x-ray
structure. These models can be used for functional studies and the prediction of
protein partners, including drugs or other proteins working in the same process.
Also, for some detailed studies, it would be convenient to refine your model by
[*Molecular Dynamics*](https://en.wikipedia.org/wiki/Molecular_dynamics) and
related methods towards a native-like structure. I suggest checking the review
by @adiyaman2019 on this topic.

On the contrary, low-accuracy comparative models are based on less than 20-30%
sequence identity, hindering the modeling capacity and accuracy. Some of these
models can be used for protein engineering purposes or to predict the function
of orphan sequences based on the protein fold (using
[Dali](http://ekhidna2.biocenter.helsinki.fi/dali/) or
[Foldseek](https://search.foldseek.com/search)).

As mentioned above, it also advisable to check the template structures and read
the papers describing them in order to squeeze all the information from your
model.

[**What do you think you could use our models of NEIL2 and
HAdV-2pol?**]{style="color:green"}

# [Homology Modeling Practice](https://www.evernote.com/shard/s62/sh/1d73369b-368c-612f-a046-83f521f497e5/c64575610629aa27013352f3ebde4422){style="color:green"}

This is a [Evernote](https://evernote.com/intl/es) note that you can consult
online and also copy into your Evenote account if you wish.