machine learning

# Regression

#### Simple Linear Regression

• Objective: model the expected value of a continuous variable, Y, as a linear function of the continuous predictor, X, E(Yi) = β0 + β1xi
• Model structure: Yi = β0 + β1xi + εi
• Model assumptions
1. The mean of the response, E(Yi), at each value of the predictor, xi, is a Linear function of the xi.
2. The errors, εi, are Independent. Absence of this phenomenon is known as Autocorrelation.
3. The errors, εi, at each value of the predictor, xi, are Normally distributed.
4. The errors, εi, at each value of the predictor, xi, have Equal variances (denoted σ2). This phenomenon is known as homoskedasticity. The presence of non-constant variance is referred to heteroskedasticity.
5. The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.

skewness and kurtosis: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix  have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.

It refers to predictors that are correlated with other predictors in the model.

• Parameter interpretation
1. Interpret the constant term $\beta_0$: Zero Settings for All of the Predictor Variables Is Often Impossible, Zero Settings for All of the Predictor Variables Can Be Outside the Data Range, The Constant Is the Garbage Collector for the Regression Model for any bias that is not accounted for by the terms in the model. The constant guarantees that your residuals have a mean of zero.
• Model fit
1. regression standard error (residual standard error)                                                    sample variance:   mean square error:
2. The Coefficient of Determination, R 2

r2 ×100 percent of the variation in y is reduced by taking into account predictor x

r2 ×100 percent of the variation in y is ‘explained by’ the variation in predictor x.”

Not causation!

R-squared cautions:

### Caution # 1

The coefficient of determination r2 and the correlation coefficient r quantify the strength of a linear relationship. It is possible that r2 = 0% and r = 0, suggesting there is no linear relation between x and y, and yet a perfect curved (or “curvilinear” relationship) exists.

### Caution # 2

A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well. Another function might better describe the trend in the data.

### Caution # 3

The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).

### Caution # 4

Correlation (or association) does not imply causation. (difference between experiment and observational study)

### Caution # 5

Ecological correlations — correlations that are based on rates or averages — tend to overstate the strength of an association.

### Caution # 6

A “statistically significant” r2 value does not imply that the slope β1 is meaningfully different from 0. In general, the larger the data set, the easier it is to reject the null hypothesis and claim “statistical significance.”

### Caution # 7

A large r2 value does not necessarily mean that a useful prediction of the response ynew, or estimation of the mean response µY, can be made. It is still possible to get prediction intervals or confidence intervals that are too wide to be useful.

#### Linear Regression

• Parameter estimates and interpretation: β^1 is estimate of the slope, etc… Do you recall, what is the interpretation of the intercept and the slope?
• Model fit:  F-statistic
• Model selection: From a plethora of possible predictors, which variables to include?

solve linear regression from SVD

Data transformation: log10 transform

Multicollinearity: how to detect, remove

Assumptions:

Data follow normal distribution

All the key covariates are included in the model
C. Xs are fixed and known
D. Responses are independent; Residual are … distributed

Residual Analysis: Plot or residuals vs. fitted value

#### Summary of advantages of GLMs over traditional (OLS) regression

• Assumptions:
• The data Y1Y2, …, Yn are independently distributed, i.e., cases are independent.
• The dependent variable Ydoes NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
• GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression logit(π) = β0 + βX.
• Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
• The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure, and overdispersion (when the observed variance is larger than what the model assumes) maybe present.
• Errors need to be independent but NOT normally distributed.
• It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
• Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

#### Other Regression Models

Ridge regression

Lasso regression: coordinate descent,  Least Angle Regression (high-dimensional data),  Information-criteria based model selection (AIC, BIC)

Multitask Lasso: The constraint is that the selected features are the same for all the regression problems, also called tasks. Joint feature selection

Elastic Net: Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

Bayesian regression (conjugate)

Polynomial regression: extending to nonlinear with basis functions

support vector regression

Localized…

weighted least square

robust regression

https://onlinecourses.science.psu.edu/stat501/node/353