Multiple Sequence Alignment (MSA)

Aligns multiple sequences instead of just two as in pairwise alignment.

Application

  1. Find conserved residues in a family of proteins. This will indicate function.
  2. Find conserved regions. e.g. exons in mammalian genomes.
  3. Evolution information. Find distant members, annotate based on a alignments, add to existing families.

Advantages over pair wise alignment

Profiles are a statistic model that contain information about conserved residues at each position in the alignment.

  1. They are better and finding distant homologues than pairwise alignment.

Building

Options

  1. evolutionary relationship. This is the end goal.
  2. structural similarity.
  3. functional similarity
  4. sequence similarity. This is the easiest to implement with prior knowledge, and once closely aligned and inferred as homologous, evolution, structure and function can also be inferred.

Exhaustive search would be

$$ O(2^NL^N) $$

where N is the number of sequences to align

and L is the length of each sequence

Progressive alignment

This is gradual sequence addition to the alignment, it builds up the alignment one sequence at a time

The most similar sequences are aligned first, so a reliable intermediate alignment can be created.

Various algorithms either add sequence to existing alignment, or two alignments may be merged.

Score the subalignments

A guide tree is used, this represents the growing multiple alignments

Problems with progressive alignment

Greediness. Initial alignments will ignore the tree, so if earlier sequences are incorrectly aligned, the error will remain incorporated into the alignment.

Guide tree

A solution to greediness is the decide which sequences should be aligned first. A guide tree can be created using pair-wise distances and a clustering method like neighbour-joining or UPGMA.

It determines the order that sequences should be aligned.

ClustalW

  1. Create a distance matrix and guide tree using neighbour-joining.
  2. Score using sum-of-pairs

Sum of pairs score

Sum-of-pairs score is used to score the alignment of a sequence to a matrix.

Score each residue, but also multiply by the weights of the two sequences.

e.g. adding sequence C to previously aligned sequences A and B

$$ \frac {S(A,C)W(A)W(C) + S(B,C)W(B)W(C)} {2}$$

Weights are used because not all sequences are equally independent and useful. e.g. if two sequences are 99% identical sequence they should be given lower weights, alternatively if two sequences differ and are acceptable, they should be given higher weights.

Gaps in ClustalW

Opening a new gap incurs a penalty, reusing an existing gap is better.

Gap penalties are also modified based on

  1. increased for hydrophobic residue regions (structural), decreased for hydrophilic (loops)
  2. increased for closely related sequences
  3. increased for positions close to existing gaps, decreased on an existing gap

T-coffee

Considers info from all sequences to avoid the greedy issue.

Extra info such as structure helps in alignment.

PSSM

Position specific score matrix

Functional sequences are part of a family. To find new members, pairwise alignment could be used against one member, or each member.

The main disadvantage is that pairwise alignment tends to miss distant relatives.

Instead, use the whole family of sequences to search for new members.

A family may have conserved regions or residues. When adding a new member, align with these conserved regions.

Profile is a statistical model that includes conservations at key positions, and character distance between these positions. It is better at finding evolutionary diverse homologues than BLAST.

A standard profile will return probability of residue seen at a position.

Issue: a score of 100% is returned from a profile made with 2 or 2000 sequences with C in position 32.

Issue: Gaps need to be scored.

Issue: Pseudocounts need to be added (simple way is +1 count for each residue/base, even if it appears in the data.)

These are the same as substitution matrix calculations, but position is also added.

Sequence Logos

Visualises the residue conservation/preference at given positions in the alignment.

x-axis is the position

y-axis is amount of info (entropy units)

Positions with highly conserved residues have high information content.

When entropy is highest, information is lowest.

$$ H_u = - \sum_{a} f_{u,a} \log_2 f_{u,a} $$

where Hu is the entropy at position u.

a is a type of residue

When all residues are present in equal proportions, f = 1/20,

then

$$ H_u = \log_2 20 $$

Information is

$$ I_u = \log_2 20 - H_u $$

where I is the information

HMM

Hidden markov models

There are states. e.g. exon, intro, intergenic.

There are transitions between states. e.g. exon -> exon, exon -> intron, exon -> intergenic

Each transition has a probability. These should add up to 1 when leaving a state, and also when entering a state.

Each state has an emission (e.g. A, C, G or T), and each type of emission has a probability, and must sum to 1.

There is no one-to-one matching of state -> emissions

Just by looking at the missions, are not sure about the states.

Path refers to the sequence of states

Training set

This model needs a training set, i.e. a set of sequences with known paths at emissions. In the above example this would be an annotated piece of DNA, with labelled exons, introns and intergenic regions.

Transition probability between from state t to state u, can be estimated by

$$ t(u,v) = \frac {m_{t,u}} {\sum_{w} m_{t,w}} $$

where mi,j is the number of transitions between state i and j in the training data.

Emissions probability can be estimated in a similar way.

If there is no data for particular emission, then the HMM cannot align sequences with residues in that position. Pseudocounts are needed.

Insert state emission is not done in the same way.. it is from overall residue composition in the training set.

HMM for multiple sequence alignment

There are separate states for each position matched, and also states for inserts and deletions at each position.

Viterbi algorithm

Given a sequence of emissions, the Viterbi algorithm finds the most likely path of states taken.

Draw a grid with states on y-axis, and positions on x-axis.

Start filling in probabilities from the left, each cell must use probability from a previous state. Evaluate the most probable previous state by evaluating it from all previous paths and taking the max probability.

Pfam

The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

HMMER

The matrix stores probabilities in negative log natural. The lowest scores mean highest probability.

Each position has three rows of scores.

  1. match emission
  2. insert emissions
  3. transition probability scores

You can see things like the most probable residues at each position

Most probably residues for an insertion.

Understanding Bioinformatics pp186

References

Understanding Bioinformatics

Understanding Bioinformatics page 201

SAG Lecture notes 3