Elements of a model
 Objective
 Model structure (e.g. variables, formula, equation, parameters)
 Model assumptions
 Parameter estimates and interpretation
 Model fit (e.g. goodnessoffit tests and statistics)
 Model selection
LDA
generative model
model p(xy) as multivariate Gaussian, Both classes have the same covariance matrix, Σ
QDA
Each class has their own Σ
Naive Bayes
generative model
Assume the xj are conditionally independent given y
MLE, Laplace smooth
Gaussian Naive Bayes
Decision Tree
Logistic Regression (one of GLM)
model the p(y=1x) as the sigmoid of log odds ratio
Variables:

 Y: a binary response variable. Y_{i} = 1 if the trait is present in observation (person, unit, etc…) i; Y_{i} = 0 if the trait is NOT present in observation i
 X = (X_{1}, X_{2}, …, X_{k}) be a set of explanatory variables which can be discrete, continuous, or a combination. x_{i} is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.
Model:
Assumptions:
 The data Y_{1}, Y_{2}, …, Y_{n} are independently distributed, i.e., cases are independent.
 Distribution of Y_{i} is Bin(n_{i}, π_{i}), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
 Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β_{0} + βX.
 Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
 The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
 Errors need to be independent but NOT normally distributed.
 It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on largesample approximations.
 Goodnessoffit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
Model Fit:
 Overall goodnessoffit statistics of the model; we will consider:
 Pearson chisquare statistic, X^{2}
 Deviance, G^{2 }and Likelihood ratio test and statistic, ΔG^{2}
 HosmerLemeshow test and statistic
 Residual analysis: Pearson, deviance, adjusted residuals, etc…
 Overdispersion
Parameter Estimation:
The maximum likelihood estimator (MLE) for (β_{0}, β_{1}) is obtained by finding (β^0,β^1)(β^0,β^1) that maximizes:
L(β0,β1)=∏i=1Nπyii(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)L(β0,β1)=∏i=1Nπiyi(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)
In general, there are no closedform solutions, so the ML estimates are obtained by using iterative algorithms such as NewtonRaphson (NR), orIteratively reweighted least squares (IRWLS)
The decision boundary is a margin, which is defined by the support vectors, which are the points that are the most difficult to tell apart
 Functional and geometric margins
Functional margin: , if y (i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^t x + b to be a large positive number; if y(i)(w^ tx + b) > 0, then our prediction on this example is correct
this point lies on the decision boundary, and all points x on the decision boundary satisfy the equation w^t x + b = 0.
 The optimal margin classifier
 Convex quadratic objective and only linear constraints (quadratic programming)
 Langrange duality primal form and dual formLagrange multiplierDual optimizationPrediction:αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction
 Kernel: we can get SVMs to learn in the high dimensional feature space space given by φ, but without ever having to explicitly find or represent vectors φ(x); valid kernels (symmetric and semi positive definite)
It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.
why we do not need to calculate the exact feature vectors but only the inner products?
 Nonseparable case and regularization The parameter C controls the relative weighting between the twin goals of making the w^2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1 the only change to the dual problem is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ αi ≤ C.
 Multiclass SVMs
What sort of optimization problem would you be solving to train a support vector machine? maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form
The regularization parameter values for parameter , which indicates the relaxation of the restriction conditions in soft margin SVM, and parameter σ , which indicates the spread of the gauss kernel distribution. With SVM, when a new feature is added, it also requires a support vector (SV) for use in classification in that dimension.
With a boosting algorithm such as AdaBoost, learning proceeds so that weight vectors are expressed with the smallest possible number of features. As a result, classification is performed with few features, and it is possible to analyze features that have a high contribution rate. If we consider the weak learner of AdaBoost is just thresholding, then when we weighting and selecting the learners, we are selecting the features at the same time.
On the other hand, SVM attempts to express the weight vectors using the smallest possible number of cases, making it difficult to perform analysis of the features from the learning space. If we have enough amount of data from all the classes for SVM, then if we do random selection from the whole data set, it will not largely affect the performance of SVM. If we are adding a new irrelevant feature, it will affect the performance of SVM, because we are changing the feature space of the data set.
In general terms SVMs are very good when you have a huge number of features. you have to be careful with how your features are scaled. Good when data are sparse. Not good when data are imbalanced. Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem in order to find a separation hyperplane, which causes an intensive computational complexity. Some other disadvantages.
kernal: linear, Gauss, sigmoid, polynomial (order, kernel scale)
Ensemble Learning
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.
 Bootstrap: any test or metric that relies on random sampling with replacement. Use the new samples to estimate statistics of the population.
Advantage: simplicity; control and check the stability of the results
Disadvantage: The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches
Disadvantages of boosting are as follows
1Time and computation expensive.
2Hard to implement in real time platform.
3Complexity of the classification increases.
 Bagging (bootstrap aggregation): after bootstrap the samples, let all hypotheses get to have a vote to make a prediction
advantage: reduce variance (use unstable/complex learners with high variance and low bias, i.e. decision trees, ann, nearestneighbors)
disadvantage: increase bias???
 Bagging decision trees
When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leafnode of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of submodels when combining predictions using bagging.
 Random Forests
There are two famous processes adopted in RF. The first step is bootstrap, where the classification trees are constructed concurrently with random sampling the data from dataset with replacement that forms into new training sets independently. Next step is bagging, which combines each tree into a classification forest and its result is decided.
For each decision tree in the random forest, unpruned trees grow to the largest. The root of every tree consists of different new training subsets created by bootstrap. Each node on branches is split using the best split among all features of datasets (or a randomly selected subset of features). The element of each leave has the same class label. The class labels of final leaves stand for the detection result of new data.
Use of the Strong Law of Large Numbers shows that they always converge so that
overfitting is not a problem.
Extremely randomized forests
Variable importance
we can calculate how much the error function drops for a variable at each split point. (information gain/ gini impurity) or mean square error…
Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.
$latex G = \sum_{i=1}^{n_c} p_i(1p_i)$
 Boosting (additive model)
Boosting is a very simple algorithm when compared to either neural networks or SVMs and as a result requires significantly less resources and time to train, and often outperforms them as well. Another favourable characteristic of boosting, is that it seems to be resistant to overfitting. Boosting gradually reduces the training error exponentially fast. Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a biasreduction technique.
Coordinate descent on minimizing the exponential loss with respect to the weights .
A widely acknowledged explanation is to view this process as an additive logistical regression.
In Gradient Boosting,“shortcomings” are identified by gradients.
Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost
 XGBoost
METRICS
metrics of binary classification
Sensitivity = Recall = True positive rate
Specificity = True negative rate
ROC Curve: The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
Multiclass classification
Bias and Variance
The bias term measures the degree to which your prediction is systematically wrong; since it only depends on the average value μ of your prediction, you won’t reduce it by gathering more data in the same way. The variance gives you a measure of how well T predicts its own average value; typically the more data you have supporting T, the smaller the variance.
 typically, bias comes from not having good hypotheses in the considered class
 Variance results from the hypothesis class containing “too many” hypotheses
 MLE estimation is typically unbiased, but has high variance
 Bayesian estimation is biased, but typically has lower variance
Optimization Methods
 Coordinate descent
 Gradient descent
 Newton method: if our error function is quadratic, this will find the global optimum in one step
Evaluation
Grid search, random search (distribution)
Cross validation, bootstraping(slide 6), stratified