Quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR) models are based on dependent variables, such as descriptors of a molecule (e.g. structure), a prediction of the properties will be made (e.g. biological/pharmacological/toxicity properties)

Data set

  1. training
  2. testing
  3. external evaluation

Enough data must be available for training.

For training it is recommended the following are available. 5 compounds per descriptor, or perhaps 20 for continuous and 10 for categorical data.

Test and external evaluation data may have less.


Types of outliers include structural outliers and activity outliers. These should be removed from the data.

To detect a structural outlier, plot the molecules in descriptor space, ones with no neighbours for a given radius can be removed. The are dissimilar to other compounds.

For activity outlier, small changes in descriptor values lead to significant changes in molecular properties, or activity.

ref: handbook of computational chemistry (leszczynski 2012) pp1315

Imbalanced data set

A class is a structural descriptor, or an activity describing the compound.

The model must be developed, so you need a number of sample compounds for each class.

Preferable each class should have a similar number of compounds, otherwise it is imbalanced and there will be modelling bias towards the larger class.

If there is an imbalance, you can undersample (leave out) compounds or repeated inclusion of compounds.

When looking at the compounds, they are either active or inactive. These will overlap in descriptor space. Only include the actives, and the inactives that are similar to the actives.


  1. Remove duplicate compounds
  2. add or remove hydrogen
  3. remove salts
  4. remove heavy metal compounds
  5. filter based on size
  6. calculate partial charge
  7. calculate 3D conformation for 3D descriptors

Normalisation of descriptor values

Different descriptors have different ranges, methods may not cope.

Normalise the descriptor values.

  1. Range scaling. 0% to 100% (min to max)
  2. autoscaling. Z-score


Descriptors removal

  1. Remove low variation descriptors (e.g. your data is all aromatic, so aromatic descriptor doesn't tell you anything)
  2. Remove descriptors with outliers
  3. Remove correlated descriptors

Descriptor addition

  1. Add based on knowledge
  2. add based on algo e.g. Stepwise regression

Descriptor count

Ideally you want good coverage for each possible combo of descriptor. Make sure there are enough compounds to cover most of the descriptor space.

Methods of modelling

  1. multiple linear regression. you will have residuals
  2. partial least squares. latent variables. it can deal with multiple dependent vars (e.g. toxic, absorption). can tolerate correlated descriptors.
  3. artificial neural networks. black box and trained with data. good for non-linear relationship.
  4. decision trees. manually halve based on descriptors to decide on active/inactive.

R2 correlation

This is related to the residual sum of square and total sum of squares and is between 0 and 1.

1 is best and indicates how much of the dependent variable can be explained by the model.

$$ R^2 = 1 - \frac{RSS}{TSS} $$

$$ RSS = \sum_{i=1}^{n} (y_i - y_{ci})^2 $$

$$ TSS = \sum_{i=1}^{n} (y_i - \mu)^2 $$

where yi is the experimentally observed value

and yci is the predicted value

and μ is the average value.

Outliers will affect R2

R2 Adjusted

$$ R^2_{adjusted} = 1- (1-R^2)\frac{n-1}{n-c-1} $$

where n is the number of compounds

and c is the number of descriptors

If the descriptors do not help then R2 adjusted will decrease


Cross-validation is removal of some of the dataset, remodelling using the remaining data, and testing against the removed data.

LOO leave one out is the simplest way of this.

Q2 measures goodness of prediction. It should not differ too much from the original R2 value.


scramble the Y or observed values, and see if the model changes.



  1. unreliable or data contains errors
  2. contains outliers
  3. low quality


  1. not enough
  2. correlated
  3. errors in data
  4. difficult to interpret

Statistical Methods

  1. simple, do not overfit
  2. must be validated externally
  3. appropriate

Activity cliff

Small change in compound (e.g. H bond) can have a huge impact on activity.

Linear models are unlikely to model this

Activity outliers may be due to the cliff

more compounds near the cliff must be tested

lack of invariance of chemical space

In computing invariants do not change.

Chemical space are the descriptors as dimensions.

the distance between compounds depends on the descriptors used.