Secondary Protein Structure Prediction

Secondary structure assigns α-helices, loops, β-sheets and β-turns to individual amino acids in the primary sequence.

Developing a method

  1. Train the method
  2. Test the method and obtain results
  3. Measure accuracy of the results results
  4. Compare to existing methods for accuracy

Measuring accuracy

Q3

Q3 score is the percent of residues a method has correctly assigned as either a α-helix, β-strand or loop.

A prediction that has too long/short alpha helices may score the same as a prediction that incorrectly identified beta strands instead of alpha helices.

SoV

Segment overlap (SOV) score is based on the fractional overlap of segments, instead of individual residues.

Statistical methods

Statistical analyses use large data sets of solved structures. It correlates features of the known structure and sequences to create statistical rules.

Chou and Fasman

Based on data, each amino acid is assigned scores or categories for:

  1. α-helix designation: Either F(strong former), f (weak former), B (strong breaker), b (weak breaker), I (indifferent)
  2. α-helix propensity: a number, higher means more like to form
  3. β-strand designation: as above
  4. β-strand propensity: as above

Self-information is conformation properties the residue has in isolation.

Short segments are tested to either form α-helices or β-strands, where the average is above a certain threshold. The segments can be cut of strong breakers.

GOR

GOR takes into the account the residue's self-information, and also takes into account info about residues +/- 8 around the residue.

e.g. directional information: a helix breaking proline i+2 residues away will make it less like a helix will be formed at i.

Pair information will take into account residues at positions i and j.

Knowledge Based methods

Uses statistical methods, but also incorporates knowledge about physicochemical properties such as size and shape.

Nearest neighbour

This assumes that short stretches of sequences may have the same secondary structure, even if the proteins are non-homologous.

For known structures, a sliding window (17 residues) and the central residue type is recorded.

Nearest Neighbor Methods at bioinfo.mbb.yale.edu

PREDATOR uses this.

Sequence Alignment

Zpred uses multiple alignment of homologous sequences to find existing solved proteins and their secondary structures. For differences, it scores them based on a conservation number, which is calculated on how similar their physicochemical properties are.

Machine-learning

A neural net is fed in the data.

PsiPred

Multiple sequence alignments are fed into a neural net.

Consensus methods

Compares predictions from multiple methods.

References

Lecture 4 of Structural Bioinformatics

Understanding bioinformatics Chapter 11