Phylogeny is the history of descent from common ancestry, or evolutionary relationships.

Trees may be rooted or unrooted.

Common ancestry is represented.

Branch lengths should be ignored.

Common ancestry is represented.

Branch lengths represent a quantitative measure of evolutionary change

An additive tree where the distance from all leaves to parents are the same.

Distances between sequences can be used to help build a tree.

A naive method for defining distance, p, as the percentage of different residues or nucleotides in the alignment.

The issue with this that it is linear. Instead, the distance should become very large as p tends towards the fraction of expected different residues in a random alignment.

Multiple substitutions at the same site are hidden, so a linear measure will underestimate the true evolutionary distance.

This is used on DNA only

$$ d = -\frac{3}{4}ln(1-\frac{4}{3}p) $$

as p approaches 0.75, d approaches infinity.

Unweighted Pair Group Method using Arithmetic Averages

- For each sequence, create a cluster containing only that sequence
- Calculate distances between all pairs of clusters and put into a distance matrix. The distance between
a pair of clusters C
_{i}and C_{j}is the mean of sequence pairwise distances where one sequence is taken from C_{i}and the other from C_{j}. - Determine which two clusters have the minimum cluster distance. Create a new cluster, C
_{k}which is the union of C_{i}and C_{j}. - Add a new node to the tree, C
_{k}with children C_{i}and C_{j}. The node should have an equal distance, d/2, all the way to the leaf nodes. - In the distance matrix, remove C
_{i}and C_{j}, add in C_{k}, and repeat.

The final result will be a tree with equal distances to the root node. UPGMA assumes that divergence occurs at the same rate, and the evolutionary distance is the same.

It produces a rooted tree.

It assumes the molecular clock is at a constant rate, and the rate of mutations is the constant. This is not the case in reality.

The Ultrametric condition for correct UPGMA tree reconstruction is that the 3 distances between 3 leaf pairs are all equal, or two are equal and one is smaller.

UPGMA [http://www.sequentix.de]

For every set of four leave nodes, we have three ways of forming 2 pairs. Add the distance between each pair and we have 3 distances. Two of them must be equal and larger than the third distance.

This is caused by the bridge joining them together.

This is another way to produce a tree.

The tree is unrooted.

The branch lengths will represent evolutionary distance. The tree is additive.

Minimum evolution theory so that the total branch lengths, S, is at a minimum.

Starts with a star (hub and spoke), and spokes are gradually pinched together.

- get original distance matrix, for all pairs of d
- create Q matrix for corrected distances
- find lowest Q score, and join corresponding nodes a and b with a new node, c.
- The branch length between a and c is given by a new branch function, C.
- The branch length between b and c can be worked out from the previous step.
- Repeat, with the new branch node replacing the two leaf nodes.

Q, corrected distance function, in given by:

$$ Q(i,j) = (n-2)d(i,j) - U_i - U_j $$

where

$$ U_i = \sum_{k=1}^n d(i,k) $$

C, the distance to new branch node k, for nodes i and j is

$$ C(i,k) = \frac {1}{2} (d(i,j) + \frac{U_i - U_j}{N-2} )$$

Neighbor joining algorithm [wikipedia.org]

Another method to build trees is parsimony. The idea is that simple explanations are better than complex ones. It minimises substitutions or evolution.

Starts with a multiple sequence alignment. A tree can be built for each column (residue).

Traverse from leaves to root

For each parent, if intersection is empty then use the union and increment mismatch counter. Otherwise use the intersection.

Traverse from root to leaves, and pick nucleotide with least change.

To test out a feature such as a subtree.

Take the multiple sequence alignment and take a random subset of columns (e.g 20,35,2,33,22,50,51)

Create a tree from that, and repeat 1000s of times.

The frequency of the same branch subtree is a measure of significance.