How to choose a statistical model?
Are My data Normally Distributed?
Problems:
 Excess kurtosis (forth moment, very big tails, due to extreme values away from the mean)
 Excess skewness (third moment, lopsided)
 Others: lognormal (a RV whose logarithm is normallydistributed), uniform, weibull, exponential…
Routine:
 Histogram (largely depends on the bin size)
 Stem and leaf plots
 Box plot: symmetrical; Q1 and Q3 the same distance from the median; “whiskers” the same length
 PP plot(compare cumulative probability of empirical data to the “test” disribution): straight line
 QQ plot (compare quantiles of empirical data to the ideals): straight line
 Hypothesis testing:Shapiro–Wilk test, Anderson–Darling test
 How to deal with multivariate data?
 What is the distribution of the data?
 How to normalize the data: square root, quarter root, log
DISTRIBUTIONS
 Discrete Distributions
 Continuous Distributions
 Bivariate Distributions
 Distributions of Functions of Random Variables
 F distribution
STATISTICS
Correlation, Partial correlation
BASICS
 Samples and population
 Central limit thereom: With a sample size of at least 30, the distribution of sample means will be approximately normally distributed regardless of the shape of the population with a mean of and a standard deviation of . In practice we often do not know μμ or σσ. In those situations we estimate and with $latex \mu_x$ and respectively.
 Sampling distributions
 Large Number Law
weak law of large numbers: converge in probability
strong law of large numbers: converge almost sure
ESTIMATION/INFERENCE (what is the value of the parameter):
 Point Estimation (MLE and Method of Moments)
 Interval Estimation (how good the point estimation is)
“We can be 95% confident that the population mean falls between L and U.”
 Confidence interval for one mean 1.1(known ) zinterval
Probability statements are about random variables. The population mean μ is a constant, not a random variable. It makes no sense to make a probability statement about a constant that doesn’t change.
The length of a confidence interval
As the confidence level decreases, the length of the interval decreases. So, for this factor, we have a bit of a tradeoff! We want a high confidence level, but not so high as to produce such a wide interval as to be useless. That’s why 95% is the most common confidence level used.
1.2 Confidence interval for one mean (unknown ) tinterval
Nonnormal data:
2. Confidence interval for two means
2.1 Two sample pooled tinterval (two means have same )
2.2 Welch’s tinterval when variances are not equal
2.3 Paired tinterval
The difference of the two means remove the twin effect???
4. Confidence interval for variances
4.1 One variance
4.2 Two variances
5. Confidence interval for proportions
5.1 One proportion
5.2 Two proportions
6. Sample sizes
6.1 Estimating a mean
6.2 Estimating a proportion for a large population
6.3 Estimating a proportion for a small, finite population
 Confidence interval relation with twotailed proportion test
 DistributionFree CIs for percentiles
 A confidence band
HYPOTHESIS TESTING: is the value of the parameter θ such and such?
(1) We’ll make an initial assumption about the population parameter (null hypothesis).
(2) We’ll collect evidence or else use somebody else’s evidence (in either case, our evidence will come in the form of data).
(3) Based on the available evidence (data), we’ll decide whether to “reject” or “not reject” our initial assumption.
 Tests about proportions (zscore) , onetailed test
 state H0 and H1
 calculate the teststatistics, which is distributed as a standard normal distribution.
 if using critical value method: determine the critical/rejection region ( “size” of the critical region is 0.05 if the significance level of the test is 0.05)
 make a decision: if the test statistic lies in the rejection region, we reject null hypothesis, because under the null hypothesis, our observation/sample is too extreme to be observed
 if using p value approach: It is the smallest α−level that would lead to rejection. The pvalue is the probability of an observed (or more extreme) result under H0.
 make a decision: if pvalue < , then reject H0. The smaller the pvalue, the larger the significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation.
 errors: Type 1 (false positive) = = = significance level of the test
 twotailed test: if using p value approach, times the p value by 2; if using rejection region approach, divided by 2.
 Comparing two proportions
 Tests about one mean
 Ztest: When population variance is known, population mean is unknown (unrealistic)
 one sample Ttest: When population variance is unknown, population mean is unknowndegree of freedom: n1; a 95% confidence interval for the mean μ is: , we can be 95% confident that the mean is in this interval.
two sample ttest (independent)
we can “remove” the dependence between X and Y by subtracting the two measurements X_{i} and Y_{i} for each pair of twins i, that is, by considering the independent measurements d = XY.
test statistic:
 Test of the equality of two means (independent, unpaired)
 When population variances are equal, pooled twosample ttest
The test statistic follows a t_{n}_{+m−2} distribution:
S_{p}^{2}, the pooled sample variance, is an unbiased estimator of the common variance σ^{2.}
2. When population variances are not equal, Welch’s ttest
The test statistic follows a t_{r} distribution. If r doesn’t equal an integer, as it usually doesn’t, then we take the integer portion of r. That is, we use [r] if necessary.
 Tests for variances
 One variance
The test statistic follows a chisquare distribution with n−1 degrees of freedom.
Test statistic follows an F distribution with n−1 numerator degrees of freedom and m−1 denominator degrees of freedom.

Tests Concerning Regression and Correlation
 Test for slope
 Tests for correlation
If (X_{i}, Y_{i}) follows a bivariate normal distribution, then testing for the independence of X and Y is equivalent to testing whether the correlation coefficient equals 0. Testing assuming (X_{i}, Y_{i}) follows a bivariate normal distribution.
A Type I error occurs if we reject the null hypothesis H_{0} (in favor of the alternative hypothesis H_{A}) when the null hypothesis H_{0 }is true. We denote α = P(Type I Error).
A Type II error occurs if we fail to reject the null hypothesis H_{0} when the alternative hypothesis H_{A }is true. We denote β= P(Type II Error).
 Power = 1 – true positive (Type 2 error)
 Calculate sample size
 Likelihood ratio test (null and alternative hypotheses are composite)
every hypothesis test that we derived in the hypothesis testing section is a likelihood ratio test
 Best Critical Regions
 A/B test
 AIC/BIC score
ANOVA(Analysis of Variance)
group/treatment/factor: which feature, different groups stand for the different value of a feature
 One factor ANOVA: to use the analysis of variance method to compare the equality of the (unknown) means μ_{1}, μ_{2}, …, μ_{m }of m normal distributions with an unknown but common variance σ^2. If the assumption is broken? ^{ }
 Normality. (1) transform your data using various algorithms so that the shape of your distributions become normally distributed or (2) choose the nonparametric KruskalWallis H Test which does not require the assumption of normality.
 homogeneity of variances. (1) Welch or (2) Brown and Forsythe test.^{ }
 test with means more than one mean, variance between groups and variance within groups are close: H0, then means are same. if variance between groups is larger than that within groups, than means are not the same. There is only one factor/treatment affecting the data.
Post hoc tests (a posteriori tests): if we want to know which two means are different.
Why not use multiple ttests: we need to do many ttests instead of one ANOVA, and it is possible that we increase the error when we do multiple ttests.
When testing these hypotheses, the important thing to remember is that we have to evaluate the significance of the interaction as our first step in looking at the output. If the interaction is significant, we can’t do much about interpreting the main effects.
 MANOVA (Multivariate Analysis of Variance)
data on p variables
NONPARAMETRIC METHODS
 Chisquare Test (how “good” do the data “fit” the probability model, sample representative for the population)Q1 is distributed as Chisquare with one degree of freedom; the expected number of successes must be at least 5 (that is, np_{1} ≥ 5) and the expected number of failures must be at least 5 (that is, n(1−p_{1}) ≥ 5), because we use central limit theorem. Extension to K categories:
 Contingency table
 Homegeneity (whether two or more multinomial distributions are equal)
If there are more than two samples, that is, if h > 2, the chisquare statistic follows an approximate chisquare distribution with h(k−1) − (k−1) = (h−1)(k − 1) degrees of freedom. #parameters???
2. Independence: testing the independence of two categorical variables
The sampling schemes???
(kh−1)−(h+k−2) = (h−1)(k − 1) degrees of freedom. #parameters???

The Wilcoxon Tests for a Median (no distribution assumption)
 Run test and test for randomness: distribution functions F(x) and G(y) of two continuous random variables X and Y, are equal.

KolmogorovSmirnov GoodnessofFit Test: how well a hypothesized distribution function F(x) fits an empirical distribution function F_{n}(x).
How to put sparsity on model:
 L1 penalty
 L2 penalty
 Laplace prior
 Factorized Laplace
 Cauchy prior
 Studentt prior
 Spike and slab prior
Norm
L infinity
Bayesian
Conjugate Prior: what is conjugate prior, conjugate prior table