Skip to content

Latest commit

 

History

History
33 lines (28 loc) · 3.28 KB

Background.md

File metadata and controls

33 lines (28 loc) · 3.28 KB

A Short Background on BLAST

Pattern Matching Algorithms

Pattern matching algorithms form the basis for modern BLAST. The "evolution" of BLAST algorithms is:

  • Needleman-Wunsch - global alignment; slow and cpu intensive, but still used for NCBI's Global Alignment tool; will miss domain or motif alignments
  • Smith-Waterman - local alignment; slow and cpu intensive, because it uses all possible "word" lengths to search for pattern hits
  • FASTA - local alignment; uses a set "word" size to search lookup tables, so it's fast
  • BLAST - local alignment; also uses "words" to search, find matches, then extend forward and backward in target sequences; is what we still use now

Amino Acid Substitution Matrices (protein only)

  • PAM - Percent Accepted Mutation
    • Algorithm can be set to accept higher numbers of mutations/differences between query and hits
    • Higher PAM number = less stringent search, more allowable sequence differences
  • BLOSUM - Blocks Substitution Matrix
    • Uses blocks of alignment in similar proteins for each position in a sequence, taken from the BLOCKS database of ungapped alignments among highly conserved regions
    • BLOSUM matrix numbers are based on percentage of similarity, so
    • Higher BLOSUM number = MORE stringent search (higher percentage of sequence identities)

BLOSUM62 performed the best of all algorithms for most protein BLAST searches, so it is used as the default setting.

  • If you want to look for less identical matches in highly conserved regions or folds, you can consider using a lower BLOSUM matrix (BLOSUM45 or 50), or try PAM70.
  • Compare the results from each matrix you try to determine what finds more matches you are interested in. Use BLAST statistics (the Bit Score) to compare results between different searches.

The statistics of BLAST

For a complete description of the statistics of BLAST at the NCBI, review their documentation on the statistics of sequence similarity searching. Here are some primary numbers to consider in your results:

  • E value ("Expect value" or "Expectation value")
    • The likelihood that the alignment has a score equivalent to or better than the BLAST-calculated raw score S that is expected to occur in a database search by chance. The lower the E value, the more significant the score.
  • S (raw score)
    • The score of an alignment, S, calculated as the sum of substitution and gap scores.
    • Substitution scores are given by a look-up table (like PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln.
  • S' (Bit Score)
    • The value S' (bit score) is derived from the raw score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

When you get a table of BLAST results, you will see a column for the "Max score" and "Total score". These are values of S, the raw score. There are also columns for the E value, percent identity, and coverage. Only when you look at each individual query to hit result will you see the bit score for that alingment.