Heuristic sequence alignment

Heuristic algorithms are much faster

They can compare against whole databases

Worse at optimal alignment and very distant relationships

Good and finding close relationships quickly.

Blast

Basic Local Alignment Search Tool (BLAST) is used to query a database of target sequences. It will return target sequences that are similar to the query sequence.

Blast Algorithm

Words -> neighbourhoods -> seeds -> HSP -> extensions

  1. Low complexity regions and repeats are removed from the query sequence
  2. The query sequence is broken up into words of k length. k=3 for protein, k=11 for DNA.
  3. Generate neighbourhood words. Each word is extended to a set of neighbourhood words with corresponding scores. Highly similar neighbourhood words will get higher scores. Smaller k will increase sensitivity but will be slower. The neighbourhood set only contains words with scores greater than t.
  4. Word hits. Scan target sequences for matching hits in the neighbourhood words. Visualize this on the grid.
  5. High scoring pairs (HSP) are determined from word hits that are closer than distance d
  6. HSP are extended a base or residue at a time, the score is adjusted for mismatch and match. Stop after a drawdown from peak score, and trim the extension back the highest score.

BLAST Algorithm [wikipedia.org].

Scoring

Raw score S

$$ S = \sum_{a,b} s_{a,b} - {gap\_penalties}$$

where sa,b are taking from a substitution matrix

The raw score has significance measure, it is just a number.

It is not comparable between runs with other parameters (e.g. matrix, target database, input sequence)

Gap Penalties

For the affine case, the gap penalty could be scored using the length of the gap.

$$ {gap\_penalty} = -b -ax $$

Where b is the penalty for opening the gap

and a is the penalty for a gap of length x

The values chosen depend on the matrix used, and whether you want the alignment to have more gaps or more residues aligned.

Bit Score S'

Raw score gives us the best extreme value.

$$ S' = \frac{\lambda S - ln K}{ln 2} $$

This allows your raw score, S, to be compared between other scores from alignments with different matrices and databases.

λ is calculated from the matrix.

K is calculated from the target database.

The bit score is a normalised raw score.

The P-value is the probability of obtaining a score equal to or better than S

$$ P_{value} = 2^{-S'} $$

The E-value can be derived from the P-value with multiple testing.

$$ E = mnP_{value} $$

where m is the length of the query sequence

and n is the length of the database

nm is the search space.

Smaller E-values, less than -0.01 imply homology. An E > 1 is expected by chance.

BLAT

BLAST like alignment tool (BLAT)

BLAT is less sensitive

Returns a single alignment instead of multiple local alignments

Used to search target genomes, genomes are indexed. This is different to BLAST that will index the input sequence.

BLAT vs BLAST [wikipedia.org].

References

Understanding Bioinformatics chapter 4

Understanding Bioinformatics chapter 5

SAG Lecture notes 3