Pattern recognition

Pattern recognition aims to recognise short common motifs believed to exist within a set of sequences.

Assume that:

  1. The pattern or motif is not yet known
  2. The pattern is expected to be short
  3. Degenerate
  4. within long sequences (e.g. TF binding sites may appear up to 1M bases away from target gene).

Enumeration, deterministic optimisation and probabilistic optimisation

Enumeration

Assume n is the length of the motif.

Create all subsequences of length n from the target sequences, and count all the different n-mers, and identify which n-mers are overrepresented.

Statistical power is an issue (needs correction -> lower power).

Enumeration may miss out on degenerate motifs, but this may be somewhat fixed by clustering or combining similar n-mers.

Gibbs Sampler

Assume n is the length of the motif.

  1. Randomly select n-mers from each target sequence. This is our 'guess' set of the motif locations.
  2. From the guess set, remove n-mer that originated from sequence 1.
  3. Position Specific Scoring Matrix (PSSM) on remaining set of n-mers
  4. Calculate likelihoods of all possible n-mers from sequence 1 against the PSSM.
  5. Pick a n-mer from sequence 1 based on likelihood score. We are not guaranteed to pick the best scoring one, just more likely to pick it.
  6. Add this new n-mer from sequence 1 into the guess set.
  7. Repeat for sequence 2 etc..
  8. Eventually it will converge.

This is stochastic monte carlo algorithm. It is not guaranteed to converge to the same set each time. Run it multiple times so it has more chances to stumble upon the correct ones.

The PSSM weights help it to converge.

AlignACE is a example program based on this algo.