Simple Linear Regression
 Objective: model the expected value of a continuous variable, Y, as a linear function of the continuous predictor, X, E(Y_{i}) = β_{0} + β_{1}x_{i}
 Model structure: Y_{i} = β_{0} + β_{1}x_{i} + ε_{i}
 Model assumptions
 The mean of the response, E(Yi), at each value of the predictor, xi, is a Linear function of the xi.
 The errors, ε_{i}, are Independent. Absence of this phenomenon is known as Autocorrelation.
 The errors, ε_{i}, at each value of the predictor, xi, are Normally distributed.
 The errors, ε_{i}, at each value of the predictor, xi, have Equal variances (denoted σ^{2}). This phenomenon is known as homoskedasticity. The presence of nonconstant variance is referred to heteroskedasticity.
 The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.
skewness and kurtosis: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix have an approximate linear dependence, the design matrix becomes close to singular and as a result, the leastsquares estimate becomes highly sensitive to random errors in the observed response, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.
It refers to predictors that are correlated with other predictors in the model.
 Parameter interpretation
 Interpret the constant term : Zero Settings for All of the Predictor Variables Is Often Impossible, Zero Settings for All of the Predictor Variables Can Be Outside the Data Range, The Constant Is the Garbage Collector for the Regression Model for any bias that is not accounted for by the terms in the model. The constant guarantees that your residuals have a mean of zero.
 Model fit
 regression standard error (residual standard error) sample variance: mean square error:

The Coefficient of Determination, R ^{2}
“r^{2} ×100 percent of the variation in y is reduced by taking into account predictor x“
“r^{2} ×100 percent of the variation in y is ‘explained by’ the variation in predictor x.”
Not causation!
Caution # 1
The coefficient of determination r^{2} and the correlation coefficient r quantify the strength of a linear relationship. It is possible that r^{2} = 0% and r = 0, suggesting there is no linear relation between x and y, and yet a perfect curved (or “curvilinear” relationship) exists.
Caution # 2
A large r^{2} value should not be interpreted as meaning that the estimated regression line fits the data well. Another function might better describe the trend in the data.
Caution # 3
The coefficient of determination r^{2} and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).
Caution # 4
Correlation (or association) does not imply causation. (difference between experiment and observational study)
Caution # 5
Ecological correlations — correlations that are based on rates or averages — tend to overstate the strength of an association.
Caution # 6
A “statistically significant” r^{2} value does not imply that the slope β_{1} is meaningfully different from 0. In general, the larger the data set, the easier it is to reject the null hypothesis and claim “statistical significance.”
Caution # 7
A large r^{2} value does not necessarily mean that a useful prediction of the response y_{new}, or estimation of the mean response µ_{Y}, can be made. It is still possible to get prediction intervals or confidence intervals that are too wide to be useful.
3. (Pearson) Correlation Coefficient r
4. Hypothesis Test for the Population Correlation Coefficient
Linear Regression
 Parameter estimates and interpretation: β^1 is estimate of the slope, etc… Do you recall, what is the interpretation of the intercept and the slope?
 Model fit: Fstatistic
 Model selection: From a plethora of possible predictors, which variables to include?
solve linear regression from SVD
Data transformation: log10 transform
Multicollinearity: how to detect, remove
Assumptions:
Data follow normal distribution
All the key covariates are included in the model
C. Xs are fixed and known
D. Responses are independent; Residual are … distributed
Residual Analysis: Plot or residuals vs. fitted value
Multivariate Regression
Generalized Linear Model
Summary of advantages of GLMs over traditional (OLS) regression
 Assumptions:
 The data Y_{1}, Y_{2}, …, Y_{n} are independently distributed, i.e., cases are independent.
 The dependent variable Y_{i }does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
 GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression logit(π) = β_{0} + βX.
 Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
 The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure, and overdispersion (when the observed variance is larger than what the model assumes) maybe present.
 Errors need to be independent but NOT normally distributed.
 It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on largesample approximations.
 Goodnessoffit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
Other Regression Models
Ridge regression
Lasso regression: coordinate descent, Least Angle Regression (highdimensional data), Informationcriteria based model selection (AIC, BIC)
Multitask Lasso: The constraint is that the selected features are the same for all the regression problems, also called tasks. Joint feature selection
Elastic Net: Elasticnet is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elasticnet is likely to pick both.
Bayesian regression (conjugate)
Polynomial regression: extending to nonlinear with basis functions
support vector regression
Localized…
weighted least square
robust regression