**Elements of a model**

- Objective
- Model structure (e.g. variables, formula, equation, parameters)
- Model assumptions
- Parameter estimates and interpretation
- Model fit (e.g. goodness-of-fit tests and statistics)
- Model selection

LDA

Naive Bayes

Decision Tree

**Logistic Regression (one of GLM)**

*Variables*:

*Y:*a binary response variable.*Y*= 1 if the trait is present in observation (person, unit, etc…)_{i}*i;**Y*= 0 if the trait is NOT present in observation_{i}*i*

*X*= (*X*_{1},*X*_{2}, …,*X*) be a set of explanatory variables which can be discrete, continuous, or a combination._{k}*x*is the observed value of the explanatory variables for observation_{i}*i*. In this section of the notes, we focus on a single variable X.

*Model*:

*Assumptions*:

- The data
*Y*_{1},*Y*_{2}, …,*Y*are independently distributed, i.e., cases are independent._{n} - Distribution of
*Y*is_{i}*Bin*(*n*, π_{i}), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)_{i} - Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables;
*logit*(π) = β_{0}+ β*X*. - Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
- The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
- Errors need to be independent but NOT normally distributed.
- It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
- Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

*Model Fit*:

- Overall goodness-of-fit statistics of the model; we will consider:
- Pearson chi-square statistic,
*X*^{2} - Deviance,
*G*^{2 }and Likelihood ratio test and statistic, Δ*G*^{2} - Hosmer-Lemeshow test and statistic

- Pearson chi-square statistic,
- Residual analysis: Pearson, deviance, adjusted residuals, etc…
- Overdispersion

*Parameter Estimation*:

The *maximum likelihood estimator* (MLE) for (β_{0}, β_{1}) is obtained by finding (β^0,β^1)(β^0,β^1) that maximizes:

L(β0,β1)=∏i=1Nπyii(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)L(β0,β1)=∏i=1Nπiyi(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)

In general, there are no closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as *Newton-Raphson* (NR), or*Iteratively re-weighted least squares* (IRWLS)

- Functional and geometric margins

Functional margin: , if y (i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^t x + b to be a large positive number; if y(i)(w^ tx + b) > 0, then our prediction on this example is correct

this point lies on the decision boundary, and all points x on the decision boundary satisfy the equation w^t x + b = 0.

- The optimal margin classifier

- Convex quadratic objective and only linear constraints (quadratic programming)
- Langrange duality primal form and dual form
Lagrange multiplier

Dual optimization

Prediction:αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction

- Kernel: we can get SVMs to learn in the high dimensional feature space space given by φ, but without ever having to explicitly find or represent vectors φ(x); valid kernels (symmetric and semi positive definite)
- Non-separable case and regularization The parameter C controls the relative weighting between the twin goals of making the ||w||^2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1 the only change to the dual problem is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ αi ≤ C.

- Multi-class SVMs

What sort of optimization problem would you be solving to train a support vector machine? maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form

The regularization parameter values for parameter , which indicates the relaxation of the restriction conditions in soft margin SVM, and parameter σ , which indicates the spread of the gauss kernel distribution. With SVM, when a new feature is added, it also requires a support vector (SV) for use in classification in that dimension.

With a boosting algorithm such as AdaBoost, learning proceeds so that weight vectors are expressed with the **smallest possible number of features**. As a result, classification is performed with few features, and it is possible to analyze features that have a high contribution rate. If we consider the weak learner of AdaBoost is just thresholding, then when we weighting and selecting the learners, we are selecting the features at the same time.

On the other hand, SVM attempts to express the weight vectors using the **smallest possible number of cases**, making it difficult to perform analysis of the features from the learning space. If we have enough amount of data from all the classes for SVM, then if we do random selection from the whole data set, it will not largely affect the performance of SVM. If we are adding a new irrelevant feature, it will affect the performance of SVM, because we are changing the feature space of the data set.

kernal: linear, Gauss, sigmoid, polynomial (order, kernel scale)

**Ensemble Learning**

Boosting, bootstrap, bagging, adaboost,

Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost

AdaBoost is a very simple algorithm when compared to either neural networks or SVMs

and as a result requires significantly less resources and time to train, and often outperforms them as well. Another favorable characteristic of AdaBoost, is that it seems to be resistant to over-fitting.

Random Forests

There are two famous processes adopted in RF. The first step is bootstrap, where the classification trees are constructed concurrently with random sampling the data

from dataset with replacement that forms into new training sets independently. Next step is bagging, which combines each tree into a classification forest and its result is decided.

For each decision tree in the random forest, unpruned trees grow to the largest. The root of every tree consists of different new training subsets created by bootstrap. Each node on branches is split using the best split among all features of datasets. The element of each leave has the same class label. The class labels of final leaves stand for the detection result of new data.

**Deep Learning**

**METRICS**

metrics of binary classification

Sensitivity = Recall = True positive rate

Specificity = True negative rate

Multiclass classification

Regression (multiple and multivariate)

Clustering and other dimension reduction