Chemoinformatics is computer based techniques for working with chemical information.

Storage, searching, retrieval, visualisation

Representing chemical compounds

We would like:

  1. computer friendly format for easy storage and retrieval
  2. capability to search. clustering, similarity, fast retrieval.
  3. store molecular properties against molecular structure

Options for representation with increasing information:

  1. trivial name
  2. 2D structure (Fischer projection, Haworth diagram). Shows molecular connectivity
  3. 3D structure (topology)
  4. surface (property mapping)

Issues include

  1. Stereochemistry (3D)
  2. Aromaticity (resonance forms in benzene)
  3. Tautomers (keep on interconverting between isomers)

Molecules as strings

  1. trivial names. popular but not systematic
  2. IUPAC (e.g. 2-amino-3-phenylpropanoic acid), Phenylalanine. can be very very long
  3. Empirical formula. C6H5CH2CH(NH2)COOH. not distinct between compounds
  4. SMILES. proprietary.

Molecules as a SMILES string have implicit hydrogens, double or triple bonds can be represented by = or #. It is a depth first tree traversal of the chemical structure.

InChI strings

IUPAC International Chemical Identifier

non-prop identifier for chemicals. It can serialise a chemical structure into a string. Designed for computers, so very big molecules will have complex strings.

  1. It is non-proprietary
  2. unique for every molecule (canonical)
  3. good for indexing in databases
  4. search engine friendly

H2O in InChI is "1S/H20/h1H2"

Fields (or layers) delimited by "/"

1S = version 1, S = standard.

Chemical formula does not have a prefix (no lower case first character)

benzene InChI is "1S/C6H6/c1-2-4-6-5-3-1/h1-6H"

h layer for hydrogen connected

c layer is for atom connections (no hydrogen)

The above are required in an InChI string.

There are other layers for proton (p), charge (q), double bonds (b), stereochemistry (t, m, s)

The different layers make it easier for a user to search or filter.

The sublayers are optional, so you can have a two strings that represent the same molecule, but one has more information.

One disadvantage is that for very large molecules the string becomes very long.

A solution to this is to hash the string into a 27 character InChIKey. 14 character from atom connection, 8 from other layers, 3 flags (standard, version, protonation), 2 hyphens.

This is useful for search engines and as a key in a database.

Atom and Connection table

Other ways to store data are in text files. An atom table, e.g. PDB file, contains a sequential list of atoms with incrementing IDs. Associated with the atom are its residue, atom name, xyz coords.

Connection table. each row is atom_id_1, atom_id_2, bond type (single, double.. )

Markush Structures

Generic structures, with multiple R (side chain) groups. R can be an enumeration of possible submolecules, or a set of variations, or repeated units.

These were used for patents where the actual molecule is hidden.


Molecules can be represented as graphs with atoms as nodes, and bonds as edges.

Structural Keys

A structural key for a molecule is an array of flags. Each flag is predefined and based on whether the molecule has a certain feature. These features are usually fragment based.

flag1 = true if molecule contains an aromatic ring

flag2 = true if molecule contains an amide ring

flag3 = true if molecule contains contains 3x C=0 double bonds


MACCS keys are a list of key rules. e.g. S exists, charged, 5 member ring exists. Some more MACCS Keys examples


  1. easy to interpret
  2. easy to search


  1. keys must be predefined
  2. keys may not cover all features
  3. keys may be irrelevant to your problem
  4. keys may be sparse (rarely flagged or set)


A fingerprint is an array of bits.

These are based on all atom-atom paths of up to length 7. Each path is hashed, causing 4-5 bits to be set in the array. It is possible to set already set bits, these are added using logical OR to the fingerprint.

This means if a molecule contains a given substructure, the corresponding bits will always be set.

However you may get false positives due to the logical OR.

Advantages include:

  1. no need to pre-define flags, all substructures and paths will be hashed
  2. More data into the same length array, without losing too much specificity (false positives)
  3. patterns overlap (sub sub pattern is hashed), so proper match is more likely.
  4. easy to compare bits for searching a substructure


  1. all substructures are considered, even if they are not useful.
  2. overlaps / mapping into same bit. must be aware of false positives
  3. original structure even harder to determine
  4. duplicate paths or multiple substructure info is not saved.

fingerprints at

Folding fingerprints

Information density (related to sparseness) is how many bits are set out of the array.

Folding fingerprints at

This squishes two bits into one using OR.

This saves space but increases chance of false positives. The space issue is for speeding up searches in databases.


List of fields per molecule.

physicochemical properties (logP, weight, H bonding)

2D description (branching factor)

3D structure keys (heme cofactor, zinc ligand, fold)

Lipinski Rule of 5

No more than one of the following tests should be broken. A molecule could have to be a potential drug if:

  1. It has no more than 5 H bond donors (NH or OH)
  2. It has no more than 10 H bond acceptors (N or O)
  3. It has a molecular mass less than 500 daltons.
  4. logP (a solubility measure) is less than 5.

String labels

Canonical names should be used to guarantee uniqueness. An algorithm must be chosen to come up with a canonical name.

Searching a subgraph

If we represent a molecule as a graph, and search for a subgraph, it is a computationally expensive NP-complete problem. O increases exponentially.

pre filtering is required (e.g. via fingerprints)


A pharmacophore is a set of features (e.g. steric, electrostatic) that are responsible for a particular biological or pharmacological interaction. Structural fragments may exhibit these features.

Pharmacophore uses

  1. Aids design of optimal ligands by focusing on the important structural features, and how modifications will affect activity.
  2. Scaffold hopping by modifying the central core structure of a molecule, but leaving the pharmacophoric groups (and interaction) the same. You may be able to avoid patents this way.
  3. Multifunctional drug design is possible.

Pharmacophore generation

Identify a set of molecules that are known to interact with the target.

Interpret their chemical structures in terms of both substructures (aromatic ring) and properties expected to be responsible for function (e.g. H bond donor or acceptor, hydrophobic region)

Align molecules and pick relevant models to biological activity.

Validate the pharmacophore model. check statistical significance, and test biologically.


Useful for

  1. finding a molecule in a database
  2. filtering
  3. testing for uniqueness
  4. ranking
  5. clustering

Structurally similar molecules will exhibit similar physical and biological activities, so structural similarity tests is a good option.

Similarity Measures

see venn diagram of molecules A and B.

a = features present in A but not in B

b = features present in B but not in A

c = features present in both A and B

d = features not present in A or B


Tanimoto similarity score is:

$$ T = \frac{c}{a+b+c} $$

d is not used, it is not useful when features are sparse.

Asymmetric measure - Tversky

Tversky measure is asymmetric, with different weightings on features of the prototype and variant.

$$ S = \frac{c}{\alpha a + \beta b + c} $$

It is useful if you are comparing a prototype against many variants. By altering α and β, you may stress the importance of either substructure or superstructure.

As an extreme example, if your prototype is "a", and you set α to 100% and β to 0%, only the prototype features are important. This is a measure of superstructure of "b".

If "a" is a substructure of "b", then "b" will have all of a's features and more. "a" will be 0.

If "a" is a superstructure of "b", then "a" will have all of b's features and more. "b" will be 0.


Monotonic functions always increase together. e.g. For all x <= y, f(x) <= f(y)

Two similarity measures are monotonic if for all combinations of "a" and "b", f(A,B) and g(A,B) give the same ranking.


A measure can be a metric if

  1. for all a!=b, f(a,b) >= 0, and f(a,a) = f(b,b) = 0
  2. symmetry: f(a,b) = f(b,a)
  3. triangle: f(a,b) <= f(a,c) + f(b,c)

similarity score distribution of fingerprint based search

Score distribution is affected by:

  1. feature frequency
  2. feature correlation (bit correlation)
  3. fingerprint size
  4. fingerprint density (derived from feature)
  5. the comparison set

data fusion

Use multiple ranked hit lists using different fingerprints. The merge rank of each molecule

It works because

  1. active (similar) molecules are more tightly clustered that inactive
  2. active are usually picked by all fingerprinting methods, but inactive ones are not repeatedly picked