Transformational Grammars

Transformational grammars are sometimes called generative grammars. They generate a sequence.

The grammar contains symbols. There are two types of symbols, terminal and nonterminals.

Terminals do not expand further, they appear in the observed string. In our case they are probably residues or nucleotides.

Non-terminals can be transformed.

The grammar is defined by production rules.

$$ S \rightarrow aS $$

$$ S \rightarrow bS $$

$$ S \rightarrow \epsilon $$

They can be written in one line.

$$ S \rightarrow aS | bS | \epsilon $$

A sequences matches the grammar if it could be generated by it.

Regular grammar

Built from rules where all rules have the format W -> aW

e.g, a grammar can define a motif pattern.

[RK]G[LIVA]

as

$$ S \rightarrow rW | kW $$

$$ W \rightarrow gX $$

$$ X \rightarrow l | i | v | a $$

Context-free grammar

Context-free grammar production rules must only have one nonterminal symbol on the left side of the rule. They are in the format W -> aSb, so it is possible to generate palindrome like strings.

e.g. for RNA stem loop

$$ S \rightarrow aW1u | cW1g | gW1c | uW1a $$

$$ W1 \rightarrow aW2u | cW2g | gW2c | uW2a $$

$$ W2 \rightarrow aW3u | cW3g | gW3c | uW3a $$

$$ W3 \rightarrow gaaa | gcaa $$

Stochastic context-free grammars (SCFG)

When new members are found, more exceptions to the pattern are added, the pattern grows larger, and loses specificity. It might begin to match unrelated sequences, especially if the pattern is diverse.

This leads to the idea of probabilistic (or stochastic) grammars. Each option is associated with a probability.

RNA families can be modelled using a covariance models or SCFG base profiles. This is better than HMMs because secondary structure constraints can be represented.

Rfam

The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs)

A covariance model is built from a MSA that has been manually curated.

Then databases are searched using the covariance model, and additional members are added to the Rfam family.

References

Biological sequence analysis - REDK