# Association and correlation

category|numeric vs. category|numeric

# Analysis of Variance (ANOVA)

Used when you have 2 data sets of numerical data.

Tests whether the 2 data sets come from the same population, by comparing variance of the 2 sets.

## Calculate test statistic

$$F = \frac{\sigma_1^2}{\sigma_2^2}$$

where σ12 is the variance from sample 1.

and σ22 is the variance from sample 2.

## Compare to F distribution

Compare the F value to the F cumulative distribution with degrees of freedom m-1 and n-1, where m is the size of sample 1 and n is the size of sample 2.

Formula for F distribution of (m,n)

$$F_{m,n} = (\sum_{i=1}^m (Z_i)^2 / m) / (\sum_{i=1}^n (Z_i)^2 / n)$$

Where Zi is a random value taken from N(0,1)

Large values are more extreme.

# One-Way ANOVA

Used when you have k sets of numerical data.

Are all the means equal between the k groups. If you find they are different, you'll still need to test each pair.

## Test Statistic

Calculate variation between groups, SSB, and DFB

Calculate total mean.

For each group, calculate mean of the group, diff with total mean, square it, multiple it by number of values in the group.

$$SSB = \sum_{i=1}^k n_i (\mu - \mu_i)^2$$

where ni is the count for group i. This gives them proportional weighting.

$$DFB = k-1$$

Calculate variation within groups, SSW, and DFW

For each k group, for each value in that group, square the difference between the value and the group's mean.

$$SSW = \sum_{i=1}^k \sum_{j=1}^{n_i} (x_{i,j}-\mu_i) ^ 2$$

where xi,j is the jth value in group i

ni is the total number of values in the ith group

$$DFW = N - k$$

where N is the total number of values, k is the number of groups.

$$F = \frac{SSB / DFB}{SSW / DFW}$$

## Compare to F distribution

Compare the F value to the F cumulative distribution with DFB and DFW degrees of freedom.

## Bonferroni correction, Ryan correction

F test will tell you if sample mean is different between k groups, but not which groups.

Adjust αFWE to 0.05/N so type I error is the same. N is the number of retests required. FWE stands for family wide error. This adjustment us conservative due to correlation.

$$\alpha_{ij} = \frac{\alpha_{FWE}}{N}$$

Ryan adjustment. Rank means and begin testing between largest differences first.

$$\alpha_{ij} = \frac{\alpha_{FWE}}{N / |R_i - R_j|}$$

## Kruskal-Wallis H-test

Works on the same data as ANOVA (k groups of data).

non-parametric. Ranks all data.

Tests whether a random sample from a group is more highly ranked that another group 50% of the time.

# Correlation

H0: There is no correlation,

Test statistic is 0 for no correlation.

## Pearson's product-moment correlation coefficient

Assumes binormal distribution of x,y values.

Affected by outliers.

Ranges from -1 to 1

Detects linear correlation (y=ax+b)

$$r = \frac{\sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\sum_{i=1}^{N}(x_i - \mu_x)^2(y_i - \mu_y)^2}}$$

Significance level, to against t distribution with N-2 degrees of freedom.

$$t = r \sqrt{\frac{N-2}{1-r^2}}$$

## Spearman's rank correlation

Uses rank instead of actual x and y values.

$$\rho = \frac{\sum_{i=1}^{N} (Rx_i - \mu_{Rx})(Ry_i - \mu_{Ry})}{\sqrt{\sum_{i=1}^{N}(Rx_i - \mu_{Rx})^2(Ry_i - \mu_{Ry})^2}}$$

$$\mu_{Rx} = \mu_{Ry} = (N+1)/2$$

$$t = \rho \sqrt{\frac{N-2}{1-\rho^2}}$$

again compare with t distribution with degrees of freedom N-2

## Kendall's tau

Matches all pairs of data. N(N+1)/2 pairs.

Tallies up concordant, discordant, extra x, extra y, or match

Maps it onto a Z distribution.