# Stats and Probability Theory

How to choose a statistical model?

Are My data Normally Distributed?

Problems:

• Excess kurtosis (forth moment, very big tails, due to extreme values away from the mean)
• Excess skewness (third moment, lopsided)
• Others: lognormal (a RV whose logarithm is normally-distributed), uniform, weibull, exponential…

Routine:

• Histogram (largely depends on the bin size)
• Stem and leaf plots
• Box plot: symmetrical; Q1 and Q3 the same distance from the median; “whiskers” the same length
• PP plot(compare cumulative probability of empirical data to the “test” disribution): straight line
• QQ plot (compare quantiles of empirical data to the ideals): straight line
• Hypothesis testing:Shapiro–Wilk testAnderson–Darling test
• How to deal with multivariate data?
• What is the distribution of the data?
• How to normalize the data: square root, quarter root, log

DISTRIBUTIONS

STATISTICS

Correlation, Partial correlation

BASICS

• Samples and population
• Central limit thereom: With a sample size of at least 30, the distribution of sample means will be approximately normally distributed regardless of the shape of the population with a mean of $\mu$ and a standard deviation of $\sigma\(\sqrt{n})$.   In practice we often do not know μμ or σσ. In those situations we estimate $\mu$ and $\sigma$ with $latex \mu_x$ and $s$ respectively.
• Sampling distributions
• Large Number Law

weak law of large numbers: converge in probability

strong law of large numbers: converge almost sure

ESTIMATION/INFERENCE (what is the value of the parameter):

“We can be 95% confident that the population mean falls between L and U.”

• Confidence interval for one mean                                                                                                 1.1(known $\sigma$) z-interval

Probability statements are about random variables. The population mean μ is a constant, not a random variable. It makes no sense to make a probability statement about a constant that doesn’t change.

The length of a confidence interval

As the confidence level decreases, the length of the interval decreases. So, for this factor, we have a bit of a tradeoff! We want a high confidence level, but not so high as to produce such a wide interval as to be useless. That’s why 95% is the most common confidence level used.

1.2 Confidence interval for one mean (unknown $\sigma$) t-interval

Nonnormal data:

2. Confidence interval for two means

2.1 Two sample pooled t-interval (two means have same $sigma$)

2.2 Welch’s t-interval when variances are not equal

2.3 Paired t-interval

The difference of the two means remove the twin effect???

4. Confidence interval for variances

4.1 One variance

4.2 Two variances

5. Confidence interval for proportions

5.1 One proportion

5.2 Two proportions

6. Sample sizes

6.1 Estimating a mean

6.2 Estimating a proportion for a large population

6.3 Estimating a proportion for a small, finite population

HYPOTHESIS TESTING: is the value of the parameter θ such and such?

(1) We’ll make an initial assumption about the population parameter (null hypothesis).

(2) We’ll collect evidence or else use somebody else’s evidence (in either case, our evidence will come in the form of data).

(3) Based on the available evidence (data), we’ll decide whether to “reject” or “not reject” our initial assumption.

1. state H0 $p = ...$ and H1 $p > ...$
2.  calculate the test-statistics, which is distributed as a standard normal distribution.
3. if using critical value method: determine the critical/rejection region ( “size” of the critical region is 0.05 if the significance level of the test is 0.05)
4.  make a decision: if the test statistic lies in the rejection  region, we reject null hypothesis, because under the null hypothesis, our observation/sample is too extreme to be observed
5. if using p value approach:   It is the smallest α−level that would lead to rejection. The p-value is the probability of an observed (or more extreme) result under H0.
6. make a decision: if p-value < $\alpha$, then reject H0. The smaller the p-value, the larger the significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation.
7. errors: Type 1 (false positive) = $\alpha$$P(\widehat{H_1}|H_0)$ = significance level of the test
8. two-tailed test: if using p value approach, times the p value by 2; if using rejection region approach, divided $\alpha$ by 2.
9. Comparing two proportions
1. Z-test: When population variance is known, population mean is unknown (unrealistic)
2. one sample T-test: When population variance is unknown, population mean is unknowndegree of freedom: n-1;  a 95% confidence interval for the mean μ is:  ,  we can be 95% confident that the mean  is in this interval.

two sample t-test (independent)

we can “remove” the dependence between X and Y by subtracting the two                    measurements  Xi and Yi for each pair of twins i, that is, by considering the independent   measurements d = X-Y.

test statistic:

• Test of the equality of two means (independent, unpaired)
1. When population variances are equal,  pooled two-sample t-test

The test statistic follows  a tn+m−2 distribution:

Sp2, the pooled sample variance, is an unbiased estimator of the common variance σ2.

2. When population variances are not equal, Welch’s t-test

The test statistic follows a tr distribution. If r doesn’t equal an integer, as it usually doesn’t,     then we take the integer portion of r. That is, we use [r] if necessary.

• Tests for variances
1. One variance

The test statistic follows a chi-square distribution with n−1 degrees of freedom.

Test statistic follows an F distribution with  n−1 numerator degrees of freedom and m−1 denominator degrees of freedom.

• Tests Concerning Regression and Correlation

1. Test for slope
2. Tests for correlation

If (XiYi) follows a bivariate normal distribution, then testing for the independence of                    X and Y is equivalent to testing whether the correlation coefficient $\rho$ equals 0.    Testing $H_0: \rho = 0$ assuming (XiYi) follows a bivariate normal      distribution.

Type I error occurs if we reject the null hypothesis H0 (in favor of the alternative hypothesis HA) when the null hypothesis His true. We denote α = P(Type I Error).

Type II error occurs if we fail to reject the null hypothesis H0 when the alternative hypothesis His true. We denote β=  P(Type II Error).

1. Power = 1 – true positive (Type 2 error)
2. Calculate sample size

every hypothesis test that we derived in the hypothesis testing section is a likelihood ratio test

ANOVA(Analysis of Variance)

group/treatment/factor: which feature, different groups stand for the different value of a feature

• One factor ANOVA: to use the analysis of variance method to compare the equality of the (unknown) means μ1μ2, …, μof m normal distributions with an unknown but common variance σ^2.                                                                                                                                    If the assumption is broken?
1. Normality. (1) transform your data using various algorithms so that the shape of your distributions become normally distributed or (2) choose the nonparametric Kruskal-Wallis H Test which does not require the assumption of normality.
2. homogeneity of variances. (1) Welch or (2) Brown and Forsythe test.
• test with means more than one mean, variance between groups and variance within groups are close: H0, then means are same. if variance between groups is larger than that within groups, than means are not the same. There is only one factor/treatment affecting the data.

ANOVA table

Post hoc tests (a posteriori tests): if we want to know which two means are different.

Why not use multiple t-tests: we need to do many t-tests instead of one ANOVA, and it is possible that we increase the error when we do multiple t-tests.

When testing these hypotheses, the important thing to remember is that we have to evaluate the significance of the interaction as our first step in looking at the output. If the interaction is significant, we can’t do much about interpreting the main effects.

• MANOVA (Multivariate Analysis of Variance)

data on p variables

Assumptions

Test statistics

NONPARAMETRIC METHODS

• Chi-square Test (how “good” do the data “fit” the probability model, sample representative for the population)Q1 is distributed as Chi-square with one degree of freedom;  the expected number of successes must be at least 5 (that is, np1 ≥ 5) and the expected number of failures must be at least 5 (that is, n(1−p1) ≥ 5), because we use central limit theorem.                         Extension to K categories:
• Contingency table
1. Homegeneity (whether two or more multinomial distributions are equal)

If there are more than two samples, that is, if h > 2, the chi-square statistic follows an                    approximate chi-square distribution with h(k−1) − (k−1) = (h−1)(k − 1) degrees of  freedom. #parameters???

2. Independence: testing the independence of two categorical variables

The sampling schemes???

(kh−1)−(h+k−2) = (h−1)(k − 1) degrees of freedom. #parameters???

How to put sparsity on model:

1. L1 penalty
2. L2 penalty
3. Laplace prior
4. Factorized Laplace
5. Cauchy prior
6. Student-t prior
7. Spike and slab prior

Norm

L infinity

Bayesian

Conjugate Prior: what is conjugate prior, conjugate prior table