Phylogeny is the history of descent from common ancestry, or evolutionary relationships.

Tree Representations

Trees may be rooted or unrooted.


Common ancestry is represented.

Branch lengths should be ignored.

Phylogram or Additive tree

Common ancestry is represented.

Branch lengths represent a quantitative measure of evolutionary change

Ultrametric tree

An additive tree where the distance from all leaves to parents are the same.

Pair-wise distances

Distances between sequences can be used to help build a tree.

A naive method for defining distance, p, as the percentage of different residues or nucleotides in the alignment.

The issue with this that it is linear. Instead, the distance should become very large as p tends towards the fraction of expected different residues in a random alignment.

Multiple substitutions at the same site are hidden, so a linear measure will underestimate the true evolutionary distance.

Jukes-Cantor distance

This is used on DNA only

$$ d = -\frac{3}{4}ln(1-\frac{4}{3}p) $$

as p approaches 0.75, d approaches infinity.


Unweighted Pair Group Method using Arithmetic Averages

UPGMA algorithm

  1. For each sequence, create a cluster containing only that sequence
  2. Calculate distances between all pairs of clusters and put into a distance matrix. The distance between a pair of clusters Ci and Cj is the mean of sequence pairwise distances where one sequence is taken from Ci and the other from Cj.
  3. Determine which two clusters have the minimum cluster distance. Create a new cluster, Ck which is the union of Ci and Cj.
  4. Add a new node to the tree, Ck with children Ci and Cj. The node should have an equal distance, d/2, all the way to the leaf nodes.
  5. In the distance matrix, remove Ci and Cj, add in Ck, and repeat.

The final result will be a tree with equal distances to the root node. UPGMA assumes that divergence occurs at the same rate, and the evolutionary distance is the same.

It produces a rooted tree.

It assumes the molecular clock is at a constant rate, and the rate of mutations is the constant. This is not the case in reality.

The Ultrametric condition for correct UPGMA tree reconstruction is that the 3 distances between 3 leaf pairs are all equal, or two are equal and one is smaller.


Additivity test

For every set of four leave nodes, we have three ways of forming 2 pairs. Add the distance between each pair and we have 3 distances. Two of them must be equal and larger than the third distance.

This is caused by the bridge joining them together.

Neighbour joining

This is another way to produce a tree.

The tree is unrooted.

The branch lengths will represent evolutionary distance. The tree is additive.

Minimum evolution theory so that the total branch lengths, S, is at a minimum.

Starts with a star (hub and spoke), and spokes are gradually pinched together.

Neighbour joining algorithm

  1. get original distance matrix, for all pairs of d
  2. create Q matrix for corrected distances
  3. find lowest Q score, and join corresponding nodes a and b with a new node, c.
  4. The branch length between a and c is given by a new branch function, C.
  5. The branch length between b and c can be worked out from the previous step.
  6. Repeat, with the new branch node replacing the two leaf nodes.

Q, corrected distance function, in given by:

$$ Q(i,j) = (n-2)d(i,j) - U_i - U_j $$


$$ U_i = \sum_{k=1}^n d(i,k) $$

C, the distance to new branch node k, for nodes i and j is

$$ C(i,k) = \frac {1}{2} (d(i,j) + \frac{U_i - U_j}{N-2} )$$

Neighbor joining algorithm []

Fitch's Parsimony

Another method to build trees is parsimony. The idea is that simple explanations are better than complex ones. It minimises substitutions or evolution.

Starts with a multiple sequence alignment. A tree can be built for each column (residue).

Parsimony algorithm

Traverse from leaves to root

For each parent, if intersection is empty then use the union and increment mismatch counter. Otherwise use the intersection.

Traverse from root to leaves, and pick nucleotide with least change.


To test out a feature such as a subtree.

Take the multiple sequence alignment and take a random subset of columns (e.g 20,35,2,33,22,50,51)

Create a tree from that, and repeat 1000s of times.

The frequency of the same branch subtree is a measure of significance.