# Statistics

## Random Variable

A random variable defines outcomes and quantifies them. e.g. X = a single dice roll value, Y = the sum of 7 dice rolls.

Now you can write P(X = 3) = 1/6, or P(Y <= 20)

## Population vs Sample

A sample is a subset of a complete population. It is often not feasible to sample the whole population.

## Average

An average is a way to calculate a central value. It could refer to mean, median or mode.

## Mean

Arithmetic mean. sum all values, divide by n, where n is the number of values.

$$\mu = \frac {1}{n} \sum_{i=1}^{n} x_i$$

### True mean

The population mean (as opposed to the mean of your sample).

## Frequency

How often a value occurs.

Absolute frequency of a value is a count of the number of times it occurred.

Relative frequency of a value is expressed as a percentage or fraction of how often it occurred.

## Mode

The most frequently occurring value.

If there are 2 most frequently occurring values then it is considered bimodal

## Median

After ordering all the values, this is the value in the middle. If there 2 values for the middle position, take the mean of them.

## Variance

For a sample, first determine the mean.

Then for each value, find the difference between the mean and the value, square it, and it a running total

then divide by n. (or n-1 if it is a sample so we get unbiased value)

$$\sigma^2 = \frac {1}{n-1} \sum_{i=1}^{n} (x_i - \mu )^2$$

Sum of variances of random variables

If X = Y - Z

then Var(X) = Var(Y) + Var(Z)

## Range

range = max - min

## Interquartile Range

Difference between the first and third quartile.

## Discrete Random Variables and Continuous Random Variables

Discrete is for distinct values (e.g. dice rolls), or integers. It is possible to have infinite number of integer outcomes and still be discrete.

Continuous is for any values in an interval. e.g. exact speed between 0 and 50 m/s can have many many decimal points, or infinite possible values.

## Probability Distribution

plot Y axis with probability. Probability is always less than 1.

plot X axis with values (outcomes) or scores.

## Probability Density Function

Used for continuous distributions, exact values are impossible, so it only makes sense to calculate for ranges.

Area under curve (e.g. X between 1.9 and 2.1) is the possibility

Full area under curve is 1.

## Cumulative Distribution Function

Formula for area under the graph in a probably density function from -infinity to x. For x, it gives area under the graph to the left of x.

You can use this function for x2, and x1, to calculate the area under the chart between x2 and x1.

In a normal distribution, if you put x = mean, then you will get 50% or 0.5

## Expected Value

ref wikipedia.

Also know as expectation, mathematical expectation, EV, mean, or first moment

notation example at lottery ticket.

X = net profit from playing lottery

E(X) = expected net profit from player lottery.

### Discrete random variable

To calculate the expected value of a discrete random variable with a probability distribution:

Determine all events and probability and value for each.

Sum for all events, probability * value. This is a weighted sum of the values.

manual example at getting data from an expected value

## Law of large numbers

Take a random variable on a population, with a known expected value. If you take a sample, you will get a sample mean. The larger your sample is, the closer it will get to the expected value.

as n approaches infinity, your sample mean will converge back to E(X)

## Permutations

How many ways can you pick r objects from n objects. Order picked matters, objects are not replaced.

Formula: P(n,r) = nPr = n! / (n-r)!

e.g. out of 10 contestants, how many ways (or scenarios) can first, 2nd, 3rd and 4th prizes be awarded in a competition?

3 letter word example at Khan Academy.

26^3 3 letter words with no restrictions

26*25*24 = 26! / (26-3)! words with duplicate letters.

## Combinations

How many ways can you pick r objects from n objects? Order picked does not matter, objects are not replaced.

Formula: C(n, r) = nCr = n! / r!(n-r)!

3 people chosen from 6 example at Khan Academy.

## Inferential Statistics

Making inferences based on a sample of data. e.g. you might only have 100 samples out of 10000 population, and you want to generalise based on your sample to the population.

## z-score

Z-score of a value is how many standard deviations away a value is from the mean.

It can be applied to any distribution (not only normal distribution)

2 sided 1.96 std. devs from mean covers 95% of the values.

## Descriptive Statistics

not inferential, simply describing data.

## Sampling distribution

random samples are taking from a population, e.g. n samples, then a statistic (e.g. mean) is found for those samples.

If you keep repeating this and take another n samples, generate the statistic, then this is distribution is the "sampling distribution".

In this example we hare using the mean as a statistic, so the distribution is the sampling distribution of the sample mean.

http://psych.hanover.edu/javatest/NeuroAnim/stats/SampDist_instr.html

This is different to a sample distribution. http://forrest.psych.unc.edu/research/vista-frames/help/lecturenotes/lecture06/sampling.html

## Skew

A description of asymmetry. Right skew or positively skewed

Left skewed or negatively skewed.

Normal distribution skew is 0.

Formula Skew

Formula standard error of Skew. sqrt(3!/n)

z-score = Skew / (standard error of Skew)

## Kurtosis

measure of tallness vs longer tails

tall = positive kurtosis, leptokurtic

normal = 3 offset

short = negative kurtosis, platykurtic

formula Kurtosis. ^4, remember -3

formula standard error of Kurtosis . sqrt(4!/n)

z-score = Kurtosis / (standard error of Kurtosis)

## Relative frequency

Relative frequency of a result = frequency count / total count

## Quantile

e.g. quartiles, deciles, percentile. Divides data set into N groups.

## Boxplot

middle box is lower quartile, median, upper quartile.

whiskers extend to a data point that is 1.5 * interquartile_range:

max(data > lower_quartile - 1.5 * interquartile_range)

min(data < upper_quartile + 1.5 * interquartile_range)

## Central limit theorem

You have a random variable or distribution (not necessarily normal).

From the distribution, take a sample of size n, and calculate the mean.

Take another sample of size n and again calculate the mean.

after repeating k trials, you'll have k means.

If you plot these, it will resemble a normal distribution.

As sample size, n, approaches infinity, a normal distribution begins to appear. Note the original distribution is not necessarily normal.

If n = 1, then it will look like the original distribution

e.g. if n=2, then some values might not be possible.

This is the sampling distribution of sample means.

Test it out at online stat book sampling distributions.

Mean of the sample means is the same as the original distribution mean

As n increases, variance or standard deviation decreases.

Variance of sample mean = var of original dist / n

std dev of sampling distribution of the sample mean is = original_std_dev / sqrt(n)

### Null Hypothesis

Assumes there is no effect or relationship between variables. The mean weight of group 1 is the same as the mean weight of group 2.

Can also have direction video.

### Null distribution

The distribution of a test statistic (e.g. measurement such as mean) if the null hypothesis is true.

### z-test

Appropriate as you can assume normal distribution tests if your sample size >= 30

need to know population std dev instead of using the sample std dev.

a z-statistic is how many std devs we are above the mean, i.e. the z-score

## t-test

For small sample sizes < 30, use the t-test. It gives fatter fails.

Population standard deviation is unknown.

If the sample size or number of data points is 7, then the degrees of freedom is 6.

T - table let's you look up using std deviation multiple using parameters:

1. degrees of freedom

2. your confidence interval (e.g. 95%)

WOrk out your sample std dev (TODO greek)

then you divided that by sqrt(

Lookup the result, it is your t-result.

### t-statistic

(sample_mean - pop_mean) / sample_std_dev

sample size - 1

## Standard Error of the Mean

denominator in z-score and t-score...

### Difference of means

Make a distribution based on the two samples. Z = X - Y

The confidence interval of 95% would be on the distribution based on the two samples. If you are using a normal distribution then you can use a z-table and find the 95% interval. A 95% confidence interval is +- 1.96 std devs away from the mean.

To work out std dev remember that var(Z) = var (X) + var(Y).

Since sample size is bigger than 30, you can approximate the population std using the sample.

## Confidence Interval

not 100% sure, but you can say you are 95% sure of something.

It is a estimate for a population parameter

## Parametric vs non-Parametric test

Parametric tests assume something about the population (e.g. that it is normally distributed).

A non-parametric test does not make an assumption.

## Mann?Whitney?Wilcoxon rank-sum test

A non-parametric test comparing the medians of two samples.

## Kolgomorov-Smirnov test

Tests two samples for whether they are drawn from the sample distribution.

## Pearson's chi square test (goodness of fit)

Compares 2 distributions examples.

patrons per day of week example at Khan Academy.

H_0 can be the expected proportions

Given observed counts

Calculate total observed counts

Calculate expected counts using the expected proportions.

chi-square statistic = X^2 = sum ( (observed-expected) ^ 2 / expected) )

e.g. if you have Monday as 30 patrons expected, 45 observed, then for Monday the portion will be (30-45)^2 / 30

= 15^2 / 30

= 225 / 30

= 7.5

Then do this for each day of the week, sum the all up, and that is the chi-square statistic.

Degrees of freedom is n-1, so if your restaurant was open 6 days of the week, df = 5.

Another example of contingency table from Khan Academy.

### Contingency table

Contingency tables hold frequencies of occurrence of events in mutually exclusive categories from two or more samples

table with sub totals for each row and column, and total.

df (degrees of freedom) is usually (rows -1)*(cols - 1). This is because subtotals are fixed, and the last value can always be derived.

## Fisher's exact test

http://en.wikipedia.org/wiki/Fisher's_exact_test

## unsorted

http://my.ilstu.edu/~wjschne/138/Psychology138Lab12.html