Next Generation Sequencing (NGS)

1000 Genomes

1000 Genomes provides free data.

International HapMap Project

The International HapMap Project is a catalog of genetic variations in humans.

Haplotype

A set of SNPs that are usually inherited together, or genes that were inherited.

If certain SNPs occur in a set then some can be inferred. Useful when only sequencing parts of genes.

haplotype / haplotypes at nature.com

Genetic variation, LD, HapMap, and beyond (2012) by Broad Institute at youtube.com

Genetic Linkage

Non-random association between SNPs due to their close proximity on the chromosome.

Linkage disequilibrium

Non-random association between SNPs, not necessarily on the same chromosome.

FastQS

FastQC" is a program written in Java that reports on the quality of NGS data.

Library

A library is a wet lab prepared sample of DNA or RNA that must be created before being used in the sequencer.

Steven R. Head, H. Kiyomi Komori, Sarah A. LaMere, Thomas Whisenant, Filip Van Nieuwerburgh, Daniel R. Salomon, and Phillip Ordoukhanian Library construction for next-generation sequencing: Overviews and challenges, BioTechniques, Vol. 56, No. 2, February 2014, pp. 61?77

Bait

Baits are used for targeted sequencing. A bait molecule can be used to 'fish' out target fragments, such as exome fragments.

Also referred to as probes or magnetic beads.

Hybridisation

Hybridisation is where a single stranded DNA will anneal (join) with a complementary DNA strand.

Hybrid capture

Use probes or magnetic beads as bait to fish out particular DNA such as exome. After a couple of days, use a magnet to pull out the bait and attached fragments. Denature the bait + fragment, wash, and you are left with your target fragments.

Off target effect

Incorrect fragments being picked up from hybrid capture.

picard

Picard is used to manipulate SAM and BAM files. It can also generate metrics or reports of your data.

Insert size

Insert size, insert length, is the fragment size before adapters have been added.

Coverage

It could refer to coverage depth. At a particular base, it may be covered 20x by 20 different reads.

Coverage breadth is how much % of the target genome have a certain coverage depth.

samtools

Can output a "flagstat" report.

Fragmentation

Breaking the original DNA or RNA strand into smaller pieces.

Lane

A lane or channel or strip where you put physical sample on before it goes into the sequencer. It has two types of oligos.

Barcoding

Used to tag different samples (e.g. source person 1 vs. source person 2) so they can be sequenced together, and then differentiated later. This is part of library prep.

Another name for the barcode is index.

This allows for pooled sample libraries.

Single end

Paired end sequencing

The sequencer will give you reads from both ends of the fragments. This is may be useful for determining alternative splicing.

Flow Cell

A glass slide with lanes.

Multiplex sequencing

Uses barcodes to distinguish between fragments from different samples.

Multiplex Sequencing Assay at illumina.com

Linker

see adapter.

Adapter

A sequence of bases appended or prepended to a single DNA strand.

Custom synthetic adapters are required for sequencers.

Oligos

A sequence of bases. oligonucleotide

Primer

Synthetic oligonucleotide

Amplification

Duplication of strands so their signal can be read from the sequencer.

PCR

There are issues with amplification. Bias, preferential amplification - not all strands are duplicated at the same rate.

Complexity

More complex means more uniformly distributed bases

Bridge Amplification

A type of PCR

Clonal amplification of a strand. The strand has been ligated on both ends using adapters. The two adapters are designed to be complementary to the two oligos on the lane. Two ends of the strand hybridised to the two oligos on the lane. DNA polymerase creates a complementary strand. The bridge is denatured and two single strands (forward and reverse) now exist. This is repeated many times.

Sequencing by synthesis

After bridge amplification, reverse strands are cleaved and washed away.

Forward strands are still there. Previously added to the DNA strand is the sequencing primer.

Synthesis starts after the sequencing primer. The read product is grown complementary to the strand from the 3' end. Fluorescent tagged nucleotides (nt) are added, but only one type will join up to the template. After each add, a light source is emitted (wavelength and intensity) and recorded, and base (G, A, T or C) is called. Fluor is cleaved away. This is repeated 100 times for a read length of 100. The read product is then washed away.

Another read product is created on the 5' adapter end, this is an index or tag for the template.

Next is the reverse strand. 3' end is uncapped, folds down onto the second type of oligos on the flow cell. The 5' is released. Read product for the reverse index is done, then read product for the template.

Now there are reads for forward and reverse, and also indices to help identify the sample.

Illumina Sequencing Technology at youtube.com

Source of errors as read length increases

3' cap didn't work.

Fluor cleavage didn't work.

Concordant

Similar or in agreement.

Ion Torrent

DNA strand is in a well, H is released and pH can be measured when bases are added.

Shearing

Process of fragmenting DNA.

GATK

The Genome Analysis Toolkit

Sequence cycle

Cycle is per base being read.

Systematic Errors

Towards the end of the read system errors can occur.

Statistcs for Genomics: Base Calling in Next Gen Data at youtube.com

Ti/Tv

Transition to transversion ratio in a set of SNPs. Expected values are 2.1 for whole genome, and 2.8 for exome.

Comparison to Microarray

Microarray Next generation sequencing (NGS)
Hybridisation Sequencing
Prior sequence info required Prior sequence not required
limited dynamic range Large dynamic range
Mature informatics and statistics Emerging informatics and statistics
cheap expensive

Structural Variation

A genomic structural variation affecting sequences of 1K to 3M in length.

Copy Number Variations

Consider reference chromosome pair. ABC,ABC. The copy number of B is 2.

Variations exist.

genome description chromosome pair copy number for B gene
reference ABC, ABC 2
heterozygous deletion AC, ABC 1
homozygous deletion AC, AC 0
heterozygous duplication ABBC, ABC 3
homozygous duplication ABBC,ABBC 4
complex amplification ABBBC,ABC 5

Loss of Heterozygosity

Typically you will an allele from father and a different allele from mother. This is heterozygosity. ABfC,ABmC

If one is deleted, copy number is now 1. AC,ABmC.

If one is copied, copy number is now 2. ABmC,ABmC.

Single Nucleotide Polymorphism (SNP)

Pronounced "snips", is where a single base differs to the base human genome. The location at which this occurs is not random, suggesting that this is inherited.

Regulatory vs Coding

Each gene has a regulatory and coding region in its sequence.

If a base is changed in the regulatory region, then the gene expression or amount of protein may change.

If a base is changed in the coding region, then it may be a:

  • Synomymous SNP or mutation. The amino acid is not changed.
  • Non-synomymous SNP. The amino acid is changed, and so is the protein sequence. This is a missense mutation

Point mutation

A type of mutation that causes a single nucleotide base change, insertion, or deletion. Frameshift mutation indicates the addition or deletion of a base pair.

Chromosomal mutations

  • deletion
  • insertion
  • inversion. part of the sequence is reversed
  • translocation, between chromosomes
  • duplication

NGS Challenges

Large amounts of raw data

CPU processing power required

software and hardware management required.

NGS steps

  1. DNA samples
  2. break DNA into smaller pieces (e.g. sonication)
  3. feed into sequencer
  4. acquire raw fastQ read data
  5. report on read quality
  6. filter reads
  7. align to reference genome (Bowtie)
  8. post alignment filter
  9. report on coverage and capture
  10. variant calling

FastQ

Raw read data from sequencer. It has an array of bases, and a corresponding quality score for each base

IGV

References

Next-Generation Sequencing Technologies - Elaine Mardis (2014) at youtube.com