Functional prediction of DNA

Sequence signals

In DNA motifs, Y is a pyramidine (C or T). W is a purines (A or G), which are weaker with two H bonds.

Genes can be found by look for their promoters near the transcription start site. Examples include

  1. TATA box
  2. Initiator element (Inr), YYANWYY
  3. Downstream promoter element (DPE)

GC content in genes are often relatively higher than the rest of the genome

Introns begin with a donor, GU, and also an acceptor site at the end, A-py{3}-AG

The 3' end of the gene has a tail, AATAAA, there is where poly tail is added.

Open reading frames (ORFs)

The genetic code (from DNA to residue) is redundant. Certain codons are more frequent. The third base in a codon often varies for the same amino acid.

Frame size is 3.

DNA can be read in either direction.

When exons are concatenated the reading frame is continued. It is possible to have an intron inside a frame.

It is possible look for long ORFs based on start (ATG) and stop codons (TAA, TAG, TGA). Since the stop codon is represented in three ways, it is likely that any long ORF is a gene.

Very short genes will not be detected in this way.

Codon frequency differs

  1. Some codons are rarer
  2. codon frequency differs between organisms
  3. third codon base is often repeated
  4. third codon is biased

We can consider these biases, but they are usually not strong enough.

Hexamer base model considers biases for overall base and residue, and residue pairs and codon pairs.

Accuracy of Prediction

Predicting exons from all open reading frames.

There can be [true|false] [positives|negatives].

$$ Sensitivity = \frac {TP} {TP + FN} $$

This is how good the prediction is at identifying real, actual, positive results.

$$ Specificity = \frac {TN} {TN + FP} $$

This is how good the prediction is at identifying negative results.

HMM can be used

GENSCAN uses introns, exons, intergenic regions, transcription, translation and splicing signals

There are three exon, intron, donor and acceptor states for the phase of the frame.


Sequence Similarity

Expressed sequence tags (ESTs) represent expressed genes that have been transcribed spliced.

Aligning these to the genome can identify exonic regions.

Comparing two closely related species' genomes can show regions of sequence similarity. If a DNA region is found to be conserved, then the region is probably negatively selected, and used by the organism. If it is already annotated and known to be expressed, then it corresponds to coding exons.

Comparing residues (in all six frames) against a protein database or a closely related species can also identify probably coding regions.

Micro RNA prediction

Micro RNA (miRNA) are short RNA molecules about 22 nucleotides long.

They have roles in gene expression.

miRNA are part of the RNA-induced silencing complex (RISC) that either represses translation or encourages mRNA degradation.

miRNA can come from a separate gene, or from introns of other genes.

They look like hairpins, so search involves this. base pairing / folding energy is used for the search.

Search predicted 11M in the human genome, many of these aren't even expressed.

To reduce results, filter by..

  1. homology and conservation with known RNAs
  2. ignore exons, repeats (but you might miss some)
  3. use clustering - miRNA hairpins are found in similar locations
  4. search for a corresponding target

Regulatory regions

Cluster TF binding sites

This works because binding sites are clustered. Searching for just one TF binding site motif will give too many hits

Phylogenetic footprinting

This is used for discovering transcription factor binding sites.

Align genomes from different species. The conserved non-coding regions are candidates for regulatory regions. This works on the assumption that the short TF binding sites are conserved between species, and the important non-coding DNA sequences will display slower evolution rates compared with the surrounding bases.

Issues include

  1. Divergence of regulator mechanisms
  2. very short sites may lead to false positives
  3. species specific knowledge
  4. more data required (more genomes)