data science · machine learning

Stats and Probability Theory

How to choose a statistical model?Capture

Are My data Normally Distributed?

Problems:

  • Excess kurtosis (forth moment, very big tails, due to extreme values away from the mean)“excess kurtosis”的图片搜索结果
  • Excess skewness (third moment, lopsided)“excess skewness”的图片搜索结果
  • Others: lognormal (a RV whose logarithm is normally-distributed), uniform, weibull, exponential…Plot of the Lognormal PDF

Routine:

  • Histogram (largely depends on the bin size)
  • Stem and leaf plots
  • Box plot: symmetrical; Q1 and Q3 the same distance from the median; “whiskers” the same length
  • PP plot(compare cumulative probability of empirical data to the “test” disribution): straight line
  • QQ plot (compare quantiles of empirical data to the ideals): straight line
  • Hypothesis testing:Shapiro–Wilk testAnderson–Darling test
  • How to deal with multivariate data?
  • What is the distribution of the data?
  • How to normalize the data: square root, quarter root, log

DISTRIBUTIONS

STATISTICS

Correlation, Partial correlation

BASICS

  • Samples and populationCapture
  • Central limit thereom: With a sample size of at least 30, the distribution of sample means will be approximately normally distributed regardless of the shape of the population with a mean of \mu and a standard deviation of \sigma\(\sqrt{n}).   In practice we often do not know μμ or σσ. In those situations we estimate \mu and \sigma with $latex \mu_x$ and s respectively. Capture.PNG
  • Sampling distributions
  • Large Number Law

weak law of large numbers: converge in probability

strong law of large numbers: converge almost sure

ESTIMATION/INFERENCE (what is the value of the parameter):

“We can be 95% confident that the population mean falls between L and U.”

  • Confidence interval for one mean                                                                                                 1.1(known \sigma) z-interval

Capture

Probability statements are about random variables. The population mean μ is a constant, not a random variable. It makes no sense to make a probability statement about a constant that doesn’t change.

The length of a confidence interval

Capture

As the confidence level decreases, the length of the interval decreases. So, for this factor, we have a bit of a tradeoff! We want a high confidence level, but not so high as to produce such a wide interval as to be useless. That’s why 95% is the most common confidence level used.

1.2 Confidence interval for one mean (unknown \sigma) t-interval

Capture

Nonnormal data:

Capture

2. Confidence interval for two means

2.1 Two sample pooled t-interval (two means have same sigma)

Capture

2.2 Welch’s t-interval when variances are not equalCapture

2.3 Paired t-interval

Capture

The difference of the two means remove the twin effect???

4. Confidence interval for variances

4.1 One varianceCapture

4.2 Two variancesCapture

5. Confidence interval for proportions

5.1 One proportionCapture

5.2 Two proportionsCapture

6. Sample sizes

6.1 Estimating a meanCaptureCapture1

6.2 Estimating a proportion for a large populationCapture

6.3 Estimating a proportion for a small, finite populationCaptureCapture1

HYPOTHESIS TESTING: is the value of the parameter θ such and such?

(1) We’ll make an initial assumption about the population parameter (null hypothesis).

(2) We’ll collect evidence or else use somebody else’s evidence (in either case, our evidence will come in the form of data).

(3) Based on the available evidence (data), we’ll decide whether to “reject” or “not reject” our initial assumption.

  1. state H0 p = ... and H1 p > ...
  2.  calculate the test-statistics, which is distributed as a standard normal distribution.         Capture.PNG
  3. if using critical value method: determine the critical/rejection region ( “size” of the critical region is 0.05 if the significance level of the test is 0.05)
  4.  make a decision: if the test statistic lies in the rejection  region, we reject null hypothesis, because under the null hypothesis, our observation/sample is too extreme to be observed
  5. if using p value approach:   It is the smallest α−level that would lead to rejection. The p-value is the probability of an observed (or more extreme) result under H0. 
  6. make a decision: if p-value < \alpha, then reject H0. The smaller the p-value, the larger the significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation.  Capture
  7. errors: Type 1 (false positive) = \alphaP(\widehat{H_1}|H_0) = significance level of the test
  8. two-tailed test: if using p value approach, times the p value by 2; if using rejection region approach, divided \alpha by 2.
  9. Comparing two proportionsCapture
  • Tests about one mean
  1. Z-test: When population variance is known, population mean is unknown (unrealistic)Capture
  2. one sample T-test: When population variance is unknown, population mean is unknownCapturedegree of freedom: n-1;  a 95% confidence interval for the mean μ is: Capture ,  we can be 95% confident that the mean  is in this interval.

two sample t-test (independent)

  1. Paired T-test (comparing the means of two dependent populations):

we can “remove” the dependence between X and Y by subtracting the two                    measurements  Xi and Yi for each pair of twins i, that is, by considering the independent   measurements d = X-Y.

test statistic: Capture.PNG

 

  • Test of the equality of two means (independent, unpaired)
  1. When population variances are equal,  pooled two-sample t-test

Capture.PNG

Capture

The test statistic follows  a tn+m−2 distribution:

Capture.PNG

Capture

        Sp2, the pooled sample variance, is an unbiased estimator of the common variance σ2.

2. When population variances are not equal, Welch’s t-test

Capture.PNG

Capture

Capture.PNG

The test statistic follows a tr distribution. If r doesn’t equal an integer, as it usually doesn’t,     then we take the integer portion of r. That is, we use [r] if necessary.

Capture.PNG

 

  • Tests for variances
  1. One variance

Capture

Capture1

Capture2

The test statistic follows a chi-square distribution with n−1 degrees of freedom.

2.  Two variances

CaptureCapture1Capture2

Test statistic follows an F distribution with  n−1 numerator degrees of freedom and m−1 denominator degrees of freedom.Capture3

  • Tests Concerning Regression and Correlation

  1. Test for slopeCapture
  2. Tests for correlation

If (XiYi) follows a bivariate normal distribution, then testing for the independence of                    X and Y is equivalent to testing whether the correlation coefficient \rho equals 0.    Testing H_0: \rho = 0 assuming (XiYi) follows a bivariate normal      distribution.

Type I error occurs if we reject the null hypothesis H0 (in favor of the alternative hypothesis HA) when the null hypothesis His true. We denote α = P(Type I Error).

Type II error occurs if we fail to reject the null hypothesis H0 when the alternative hypothesis His true. We denote β=  P(Type II Error).

  1. Power = 1 – true positive (Type 2 error)
  2. Calculate sample sizeCapture

Capture

every hypothesis test that we derived in the hypothesis testing section is a likelihood ratio test

ANOVA(Analysis of Variance)

group/treatment/factor: which feature, different groups stand for the different value of a feature

  • One factor ANOVA: to use the analysis of variance method to compare the equality of the (unknown) means μ1μ2, …, μof m normal distributions with an unknown but common variance σ^2.                                                                                                                                    If the assumption is broken?        
  1. Normality. (1) transform your data using various algorithms so that the shape of your distributions become normally distributed or (2) choose the nonparametric Kruskal-Wallis H Test which does not require the assumption of normality.
  2. homogeneity of variances. (1) Welch or (2) Brown and Forsythe test.                                                                                                                           
  • test with means more than one mean, variance between groups and variance within groups are close: H0, then means are same. if variance between groups is larger than that within groups, than means are not the same. There is only one factor/treatment affecting the data.

ANOVA tableCapture

Post hoc tests (a posteriori tests): if we want to know which two means are different.

Why not use multiple t-tests: we need to do many t-tests instead of one ANOVA, and it is possible that we increase the error when we do multiple t-tests.

Capture

When testing these hypotheses, the important thing to remember is that we have to evaluate the significance of the interaction as our first step in looking at the output. If the interaction is significant, we can’t do much about interpreting the main effects.

  • MANOVA (Multivariate Analysis of Variance)

data on p variables

CaptureCapture2

Assumptions

Test statistics 

NONPARAMETRIC METHODS

  • Chi-square Test (how “good” do the data “fit” the probability model, sample representative for the population)CaptureCapture1Q1 is distributed as Chi-square with one degree of freedom;  the expected number of successes must be at least 5 (that is, np1 ≥ 5) and the expected number of failures must be at least 5 (that is, n(1−p1) ≥ 5), because we use central limit theorem.                         Extension to K categories:CaptureCapture1
  • Contingency table
  1. Homegeneity (whether two or more multinomial distributions are equal)CaptureCaptureCapture

If there are more than two samples, that is, if h > 2, the chi-square statistic follows an                    approximate chi-square distribution with h(k−1) − (k−1) = (h−1)(k − 1) degrees of  freedom. #parameters???

2. Independence: testing the independence of two categorical variables

   The sampling schemes???Capture

(kh−1)−(h+k−2) = (h−1)(k − 1) degrees of freedom. #parameters???

How to put sparsity on model:

  1. L1 penalty
  2. L2 penalty
  3. Laplace prior
  4. Factorized Laplace
  5. Cauchy prior
  6. Student-t prior
  7. Spike and slab prior

Norm

L infinity

 

Bayesian

Conjugate Prior: what is conjugate prior, conjugate prior table

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s