# Classification

Elements of a model

• Objective
• Model structure (e.g. variables, formula, equation, parameters)
• Model assumptions
• Parameter estimates and interpretation
• Model fit (e.g. goodness-of-fit tests and statistics)
• Model selection

LDA

generative model

model p(x|y) as multivariate Gaussian, Both classes have the same covariance matrix, Σ

QDA

Each class has their own Σ

Naive Bayes

generative model

Assume the xj are conditionally independent given y

MLE, Laplace smooth

Gaussian Naive Bayes

Decision Tree

Logistic Regression (one of GLM)

model the p(y=1|x) as the sigmoid of log odds ratio

Variables:

• Y: a binary response variable. Yi = 1 if the trait is present in observation (person, unit, etc…) i; Yi = 0 if the trait is NOT present in observation i
• X = (X1, X2, …, Xk) be a set of explanatory variables which can be discrete, continuous, or a combination. xi is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.

Model:

Assumptions:

• The data Y1, Y2, …, Yn are independently distributed, i.e., cases are independent.
• Distribution of Yi is Bin(ni, πi), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
• Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β0 + βX.
• Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
• The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
• Errors need to be independent but NOT normally distributed.
• It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
• Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

Model Fit:

• Overall goodness-of-fit statistics of the model; we will consider:
1. Pearson chi-square statistic, X2
2. Deviance, G2 and Likelihood ratio test and statistic, ΔG2
3. Hosmer-Lemeshow test and statistic
• Residual analysis: Pearson, deviance, adjusted residuals, etc…
• Overdispersion

Parameter Estimation:

The maximum likelihood estimator (MLE) for (β0, β1) is obtained by finding (β^0,β^1)(β^0,β^1) that maximizes:

L(β0,β1)=i=1Nπyii(1πi)niyi=i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)L(β0,β1)=∏i=1Nπiyi(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)

In general, there are no closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as Newton-Raphson (NR), orIteratively re-weighted least squares (IRWLS)

Support Vector Machines

The decision boundary is a margin, which is defined by the support vectors, which are the points that are the most difficult to tell apart

• Functional and geometric margins

Functional margin: , if y (i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^t x + b to be a large positive number; if y(i)(w^ tx + b) > 0, then our prediction on this example is correct

this point lies on the decision boundary, and all points x on the decision boundary satisfy the equation w^t x + b = 0.

• The optimal margin classifier
2. Langrange duality         primal form and dual formLagrange multiplierDual optimizationPrediction:αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction
• Kernel: we can get SVMs to learn in the high dimensional feature space space given by φ, but without ever having to explicitly find or represent vectors φ(x); valid kernels (symmetric and semi positive definite)

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

why we do not need to calculate the exact feature vectors but only the inner products?

• Non-separable case and regularization The parameter C controls the relative weighting between the twin goals of making the ||w||^2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1 the only change to the dual problem is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ αi ≤ C.
• Multi-class SVMs

What sort of optimization problem would you be solving to train a support vector machine? maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form

The regularization parameter values for parameter $C$, which indicates the relaxation of the restriction conditions in soft margin SVM, and parameter σ , which indicates the spread of the gauss kernel distribution. With SVM, when a new feature is added, it also requires a support vector (SV) for use in classification in that dimension.

With a boosting algorithm such as AdaBoost, learning proceeds so that weight vectors are expressed with the smallest possible number of features. As a result, classification is performed with few features, and it is possible to analyze features that have a high contribution rate. If we consider the weak learner of AdaBoost is just thresholding, then when we weighting and selecting the learners, we are selecting the features at the same time.

On the other hand, SVM attempts to express the weight vectors using the smallest possible number of cases, making it difficult to perform analysis of the features from the learning space.  If we have enough amount of data from all the classes for SVM, then if we do random selection from the whole data set, it will not largely affect the performance of SVM. If we are adding a new irrelevant feature, it will affect the performance of SVM, because we are changing the feature space of the data set.

In general terms SVMs are very good when you have a huge number of features. you have to be careful with how your features are scaled. Good when data are sparse. Not good when data are imbalanced. Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem in order to find a separation hyperplane, which causes an intensive computational complexity. Some other disadvantages.

kernal: linear, Gauss, sigmoid, polynomial (order, kernel scale)

Ensemble Learning

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

• Bootstrap: any test or metric that relies on random sampling with replacement. Use the new samples to estimate statistics of the population.

Advantage: simplicity;  control and check the stability of the results

Disadvantage: The apparent simplicity may conceal the fact that important                assumptions are being made when undertaking the bootstrap analysis (e.g.    independence of samples) where these would be more formally stated in other approaches

Disadvantages of boosting are as follows

1-Time and computation expensive.

2-Hard to implement in real time platform.

3-Complexity of the classification increases.

• Bagging (bootstrap aggregation): after bootstrap the samples, let all hypotheses get to have a vote to make a prediction

advantage: reduce variance (use unstable/complex learners with high variance and    low bias, i.e. decision trees, ann, nearest-neighbors)

• Bagging decision trees

When bagging with decision trees, we are less concerned about individual trees over-fitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

• Random Forests
There are two famous processes adopted in RF. The first step is bootstrap, where the classification trees are constructed concurrently with random sampling the data from dataset with replacement that forms into new training sets independently. Next step is bagging, which combines each tree into a classification forest and its result is decided.
For each decision tree in the random forest, unpruned trees grow to the largest. The root of every tree consists of different new training subsets created by bootstrap. Each node on branches is split using the best split among all features of datasets (or a randomly selected subset of features). The element of each leave has the same class label. The class labels of final leaves stand for the detection result of new data.

Use of the Strong Law of Large Numbers shows that they always converge so that
overfitting is not a problem.

Extremely randomized forests

Variable importance

we can calculate how much the error function drops for a variable at each split point. (information gain/ gini impurity) or mean square error…

Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

$latex G = \sum_{i=1}^{n_c} p_i(1-p_i)$

Boosting is a very simple algorithm when compared to either neural networks or   SVMs and as a result requires significantly less resources and time to train, and           often outperforms them as well. Another favourable characteristic of boosting, is         that it seems to be resistant to over-fitting. Boosting gradually reduces the training error exponentially fast. Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique.

Coordinate descent on minimizing the exponential loss with respect to the weights     $\alpha$.

A widely acknowledged explanation is to view this process as an additive logistical regression.

• XGBoost

#### METRICS

metrics of binary classification

Sensitivity = Recall = True positive rate

Specificity = True negative rate

ROC Curve: The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

Multiclass classification

Bias and Variance

The bias term measures the degree to which your prediction is systematically wrong; since it only depends on the average value μ of your prediction, you won’t reduce it by gathering more data in the same way. The variance gives you a measure of how well T predicts its own average value; typically the more data you have supporting T, the smaller the variance.

• typically, bias comes from not having good hypotheses in the considered class
• Variance results from the hypothesis class containing “too many” hypotheses
• MLE estimation is typically unbiased, but has high variance
• Bayesian estimation is biased, but typically has lower variance

Optimization Methods

• Coordinate descent