Microarray

Microarray is part of transcriptomics. It measures quantities of mRNA.

A microarray is a chip that contains millions of specific probes on a planar surface.

Probes are complementary to the nucleotides of interest.

There are multiple copies of the same probe on the chip.

A dual colour array could hybridise to strands that have been dyed.

Microarray steps

  • acquire sample
  • purify to extract mRNA or DNA
  • reverse transcriptase creates cDNA
  • coupling of dyes or labels. e.g. IVT labelling
  • hybridisation onto the probe
  • washing probe
  • laser scan
  • normalise and analyse data

Data analysis workflow

Raw data

The amount of hybridisation is quantified.

Raw data is read from the pixel intensities

Quality control

Normalisation

Variation is due to sample preparation, reagents, arrays.

Expression matrix

HeatMap

A heatmap is a representation of the information. Samples with similar patterns are adjacent (vertical columns), features within samples are adjacent.

Values are colour coded for different scores.

Each row is a gene expression profile across all samples / experimental conditions.

Differential Expression

Statistical tools like R can be used. Data sets can be very large so computers are used for processing.

  • quantitatively assess the results
  • discover features
  • identify sources of bias

workflow for differential gene expression

  1. build an hypothesis that around the biological problem
  2. perform a statistical test against the hypothesis that is appropriate for the experiment's design
  3. calculate P value
  4. adjust for multiple testing
  5. choose a significance level
  6. check for biological relevance

t-tests can compare two samples and look for significant differences. But in this simple case will we get many false positives as there are many genes.

FDR is the false discovery rate. Multiple testing correction must be done.

Family-wise Error Rate (FWER) is strict and aims to filter out all false discoveries.

A moderated t-statistic from the Limma R package can be used instead. It takes into account standard errors across all genes.

Clustering analysis

Clustering is used to group items and identify commonalities.

Gene clustering find genes co-expressed across various samples

Sample clustering finds samples that are similar to each other in gene expression.

Clustering methods are either

  • hierarchical. divisive (break apart the data), or agglomerative (start with 1 member in a group and add to it)
  • partitioning methods. first determine number of clusters, then apportion.
Distance measures

Distance must be measured. Methods for distance:

  • euclidean. penalized for outliers, clusters genes by expression level
  • manhattan. good for noisy data
  • correlation. overall shape or profile
  • correlation squared. groups genes with mirror profiles
Cluster algorithms
  • single linkage: the shortest distance between a pair in the cluster
  • complete linkage: the maximum distance between a pair in the cluster
  • average linkage: mean distance between all pairs in the cluster
  • ward: minimize variance in a cluster
Visualisation

A hierarchical cluster can be visualised in a heatmap. Samples are put into an hierarchical cluster, similar samples and rearranged to be adjacent. Genes with similar patterns across samples are rearranged.

Classification

Clustering can help with classification or to label samples into types.

A new sample can be scored against other samples in a classification to determine the best fit.

Reasons for a microarray or transcriptomics experiments

  • Comparison of diseased to normal tissue
  • Classify different tissue types or cell populations
  • Determine gene expression for a specific clinical profile
  • Gene expression before and after treatment