Sequence Alignment

Sequence alignment can be used on a sequence of unknown origin. You can discover:

  • Whether the sequence is part of existing genes.
  • Whether it is similar to existing genes or families. If sequence is similar you may infer similar structure or function.
  • Phylogeny to trace evolution within a family of proteins or species

Challenges include differing length of sequences

Multiple copies in the genome (pseudogenes)

Alignment is the task of locating equivalent regions between sequences to maximum similarity.

Mutation usually makes genes different between species, even if the protein is the same.

Identity alignment

Gaps are caused by insertions and deletions (indels).

Homology refers to similarity of sequence or structure due to a common ancestor.

A paralogue is a pair of genes that have a common ancestor, and are in the same genome.

An orthologue is a pair of genes that a have a common ancestor, and are in different species' genomes.

Homology implies a common function or structure. Conserved amino acids are important to structure and/or function. However, function can be modified with little change in sequence.

Structure is more conserved than sequence. Low sequence similarity does not mean different structure or function.

Convergent evolution does not usually produce highly similar sequences.

Global Alignment

Considers the full length of both sequences

Needleman and Welsch algorithm.

Local Alignment

There can be multiple local alignments.

Considers subsequences from both query and target sequence that have positive scores.

Smith-Waterman algoritm.

Scoring

Scoring system is required as there are can be multiple solutions. There will be optimal alignment(s) with the highest score, then suboptimal alignments with lower scores.

Percentage identity

Percent identity is simple. Significance depends on length of the sequence.

DNA vs protein sequence is 4 bases compared to 20 amino acids. For DNA scoring, consider redundant mapping of DNA to protein.

30% percentage identity is significant and considered homologous.

20%-30% maybe.. twilight zone

less than 20% midnight zone.. can't say.

dot-plot

X-Y grid with identical residues marked. Matching residue regions will be filled diagonally.

Noise can occur, and should be filtered using a configured moving window length

dot-plot can be done the same sequence. e.g. in BRCA2 you can see repeats in the sequence as short diagonal lines.

low complexity regions will also appear as dark squares

Exons can be seen when comparing cDNA (from RNA) and DNA

Substitution Matrices

A substitution matrix returns a score given alignment of two residues. The score between two residues is considered independently (previous / next residues are ignored).

Even if the residues are identical, the scores can be different (e.g. cysteine forms disulphide bonds and is very conserved, while alanine is less so).

Physicochemically similar pairs of residues would have better scores than different. e.g. D(aspartic acid) and E(glutamic acid) are similar.

Matrix chosen depends on expectation of closely related sequences, or distant relationships.

The matrix could score identity and highly conserved substitutions favourably, or be more favourable to moderately conserved substitutions.

PAM substitution matrix

Used by Margaret Dayhoff, this matrix is derived from observed residue substitutions (or mutation events) in highly related proteins. From this, probability of mutation for each residue was calculated in a evolutionary time period, then converted to logs so scores can be summed.

The PAMn matrix represents a matrix where n mutation events have occurred in 100 residues. PAM250 is best to detect distant relationships, while PAM120 corresponding to a smaller number of mutations. It can be above 100 because some residues may be mutated multiple times.

It performs better than relying on physicochemical properties of residues.

BLOSUM matrix

Uses local multiple alignment rather than global alignment. Uses proteins from SWISS-PROT. Clustered into n percentage identity for BLOSUM-n

The scores are not comparable between BLOSUM tables.

Considerations in choosing a matrix

Short sequences should use short evolutionary time frames

Anticipate of evolutionary distance before choosing a matrix.

Gaps

Gaps are penalised in the alignment.

Structural analysis has shown that gaps are rarely introduced between structural elements (alpha helix, beta sheets).

And when a gap is added, it is usually more than one residue long.

Opening a gap will have a penalty.

In the affine case, extending a gap will have a different penalty.

In the linear case, extending a gap will have the same penalty.

Considerations for choosing a gap penalty

For a strict match, use a high penalty

For distant relations, use a low penalty.

It is possible to define a gap penalty based on residue (e.g. tryptophan)

Relative likelihood

A score can be calculated from the relative likelihood that the sequences are related, as opposed to unrelated.

If we assume the two aligned sequences, x and y, are unrelated, then we can use the random model, R.

$$ P(x,y | R) = \prod_{i} p_{x_i} \prod_{i} p_{y_i} $$

This is the probability of seeing these sequences being aligned randomly. Each residue in x does not influence the probability of the residue in the corresponding position in y.

Alternatively, if they are related, we use another model, M, known as the match model.

In this case, it is the probability of a residue being derived from a common ancestor.

$$ P(x,y | M) = \prod_{i} q_{x_i y_i} $$

The odds ratio is the ratio of the two likelihoods, or the odds of the alignment coming from the related model.

$$ \frac{\prod_{i} q_{x_i y_i}} {\prod_{i} p_{x_i} \prod_{i} p_{y_i}} = \prod_{i} \frac {q_{x_i y_i}} { p_{x_i} p_{y_i}} $$

To use additive scoring, it is converted into the 'log odds ratio', where

$$ s(a,b) = \log{\frac{q_{ab}}{p_a p_b}} $$

Now the score can be calculated as

$$ S = \sum_{i} s(x_i, y_i) $$

s represents the substitution matrix used.

References

Understanding Bioinformatics chapter 4, 5

Biological sequence analysis DEKM chapter 2