Tertiary Protein Structure Prediction

A protein's three dimensional structure is helpful in determining function.

Residues far apart in the sequence may be close together in the folded protein, and tertiary structure is used to determine binding sites.

Structures are solved via x-ray crystallography or NMR, but this is expensive to do and problematic. For crystallography proteins may not crystallize, and for NMR they may not be soluble.

Ab Initio

Ab initio approach is used from sequence alone. Proteins will fold into their native state at the lowest energy conformation. To calculate this, possible conformations are evaluated using their physicochemical and thermodynamic properties.

There a vast number of possible conformations, and it is very computational expensive to calculate them all.

There will be local minimal energy states, so it not as easy as incrementally modifying the conformation towards the correct state. e.g. a valley with different peaks and troughs or wells.

An energy function must be defined. Perhaps it includes phi and psi angles, bond lengths, etc. These all contribute to the force field (function) of a particular conformation and must be calculated. e.g. PROCHECK can score the stereochemical features of a conformation.

Different portions of the protein may have high/low energy (favourable) conformations.

Solvent needs to be taken into account or approximated. (e.g. hydrophobic core protein)

Molecular dynamics predict the movement of atoms over time frames as short as 10-15 seconds. It is very computationally expensive.

Evolutionary Coupling

Residues changes during evolution may be correlated

If the changes happened at the same time it is probably because the residues are next to each other when the protein is folded.

Add this to the force field / scoring system.

Threading or Fold Recognition

Tests all known protein folds for the best match. There is a limited number of protein folds that exist (around 2000). Non homologous proteins can have the same fold. Structural relationship between proteins exist even if primary sequence similarity is low due to common ancestry, or physical constraints of fold adoption.

Fit the target sequence into all known folds, and choose the one with the lowest energy and favourable stereochemistry.

Threading can be visualised as pulling a string of amino acids through the known fold. Once structurally aligned, a score is a calculated for that fold. After threading into all known folds, the fold with the best score is chosen.

CATH and SCOP are databases of non-redundant protein folds. These can be used for matching your sequence against.

Scoring

FUGUE method depends on: main chain conformation, solve accessibility, H bond, amino acid substitution.

Penalties for inserts or deletions inside secondary structure elements are given higher than indels found outside.

GenTHREADER method depends on solvation potential or energy based on distance between atoms.

Check that homologous proteins are also deemed to have the same fold.

Check consensus between programs.

Check fold based on known protein function.

Homology Modelling

Also known as comparative or knowledge based modelling.

This requires an homologous protein with known structure. It is the most reliable technique for structure.

Structure is more conserved than sequence.

Higher sequence identity results in better model. 25% is not worth doing unless you have other knowledge, 90% can be just as good as crystallography.

Assumptions

Conserved regions will have identical coordinates. In reality they are similar but not identical.

Loop regions will have the indels due to a conserved core.

Pick template

You can pick the single best homologue if they are closely related.

Average from a few models.

Fragments from a few models. Joins are error prone.

Sequence alignment

Even 1 wrong amino acid can cause a poor model.

If it seems wrong you can realign. Check the model after it has been created. Try to align more sequences into the template. Use knowledge of the active site.

Model core

The core is structurally conserved and modelled first. Check indels and whether moving these would be favourable.

Sidechains are modelled.

Model loop

Loops are the most difficult to model and not as conserved as the core. Loops are often functionally important and are also the most mobile part of the structure.

The loops are searched for in a fragment database, then anchored into the previously modelled core.

MODELLER is used for this

Energy minimisation

Loops are remodelled to minimise energy and pick the best conformation for stereochemistry.

Check it

Use PROCHECK to check stereochemistry

ANOLEA to check similarity of solved structures in a knowledge base.

Errors in Homology Model

Loops modelled wrongly

Template selection

Sequence misalignment

Side packing can be variable

Checks

Stereochemistry check by PROCHECK

Energy potentials by Prosa, zDope

References

Understanding Bioinformatics chapter 13