Elements of a model
 Objective
 Model structure (e.g. variables, formula, equation, parameters)
 Model assumptions
 Parameter estimates and interpretation
 Model fit (e.g. goodnessoffit tests and statistics)
 Model selection
LDA
Naive Bayes
Decision Tree
Logistic Regression (one of GLM)
Variables:

 Y: a binary response variable. Y_{i} = 1 if the trait is present in observation (person, unit, etc…) i; Y_{i} = 0 if the trait is NOT present in observation i
 X = (X_{1}, X_{2}, …, X_{k}) be a set of explanatory variables which can be discrete, continuous, or a combination. x_{i} is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.
Model:
Assumptions:
 The data Y_{1}, Y_{2}, …, Y_{n} are independently distributed, i.e., cases are independent.
 Distribution of Y_{i} is Bin(n_{i}, π_{i}), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
 Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β_{0} + βX.
 Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
 The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
 Errors need to be independent but NOT normally distributed.
 It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on largesample approximations.
 Goodnessoffit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
Model Fit:
 Overall goodnessoffit statistics of the model; we will consider:
 Pearson chisquare statistic, X^{2}
 Deviance, G^{2 }and Likelihood ratio test and statistic, ΔG^{2}
 HosmerLemeshow test and statistic
 Residual analysis: Pearson, deviance, adjusted residuals, etc…
 Overdispersion
Parameter Estimation:
The maximum likelihood estimator (MLE) for (β_{0}, β_{1}) is obtained by finding (β^0,β^1)(β^0,β^1) that maximizes:
L(β0,β1)=∏i=1Nπyii(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)L(β0,β1)=∏i=1Nπiyi(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)
In general, there are no closedform solutions, so the ML estimates are obtained by using iterative algorithms such as NewtonRaphson (NR), orIteratively reweighted least squares (IRWLS)
 Functional and geometric margins
Functional margin: , if y (i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^t x + b to be a large positive number; if y(i)(w^ tx + b) > 0, then our prediction on this example is correct
this point lies on the decision boundary, and all points x on the decision boundary satisfy the equation w^t x + b = 0.
 The optimal margin classifier
 Convex quadratic objective and only linear constraints (quadratic programming)
 Langrange duality primal form and dual formLagrange multiplierDual optimization
Prediction:αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction
 Kernel: we can get SVMs to learn in the high dimensional feature space space given by φ, but without ever having to explicitly find or represent vectors φ(x); valid kernels (symmetric and semi positive definite)
 Nonseparable case and regularization The parameter C controls the relative weighting between the twin goals of making the w^2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1 the only change to the dual problem is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ αi ≤ C.
 Multiclass SVMs
What sort of optimization problem would you be solving to train a support vector machine? maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form
The regularization parameter values for parameter , which indicates the relaxation of the restriction conditions in soft margin SVM, and parameter σ , which indicates the spread of the gauss kernel distribution. With SVM, when a new feature is added, it also requires a support vector (SV) for use in classification in that dimension.
With a boosting algorithm such as AdaBoost, learning proceeds so that weight vectors are expressed with the smallest possible number of features. As a result, classification is performed with few features, and it is possible to analyze features that have a high contribution rate. If we consider the weak learner of AdaBoost is just thresholding, then when we weighting and selecting the learners, we are selecting the features at the same time.
On the other hand, SVM attempts to express the weight vectors using the smallest possible number of cases, making it difficult to perform analysis of the features from the learning space. If we have enough amount of data from all the classes for SVM, then if we do random selection from the whole data set, it will not largely affect the performance of SVM. If we are adding a new irrelevant feature, it will affect the performance of SVM, because we are changing the feature space of the data set.
In general terms SVMs are very good when you have a huge number of features. you have to be careful with how your features are scaled. Good when data are sparse. Not good when data are imbalanced. Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem in order to find a separation hyperplane, which causes an intensive computational complexity. Some other disadvantages.
kernal: linear, Gauss, sigmoid, polynomial (order, kernel scale)
Ensemble Learning
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.
 Bootstrap: any test or metric that relies on random sampling with replacement. Use the new samples to estimate statistics of the population.
Advantage: simplicity; control and check the stability of the results
Disadvantage: The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches
 Bagging (bootstrap aggregation): after bootstrap the samples, let all hypotheses get to have a vote to make a prediction
advantage: reduce variance (use unstable/complex learners with high variance and low bias, i.e. decision trees, ann, nearestneighbors)
disadvantage: increase bias???
 Bagging decision trees
When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leafnode of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of submodels when combining predictions using bagging.
 Random Forests
There are two famous processes adopted in RF. The first step is bootstrap, where the classification trees are constructed concurrently with random sampling the data from dataset with replacement that forms into new training sets independently. Next step is bagging, which combines each tree into a classification forest and its result is decided.
For each decision tree in the random forest, unpruned trees grow to the largest. The root of every tree consists of different new training subsets created by bootstrap. Each node on branches is split using the best split among all features of datasets (or a randomly selected subset of features). The element of each leave has the same class label. The class labels of final leaves stand for the detection result of new data.
Use of the Strong Law of Large Numbers shows that they always converge so that
overfitting is not a problem.
Extremely randomized forests
Variable importance
we can calculate how much the error function drops for a variable at each split point. (information gain/ gini impurity) or mean square error…
Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.
$latex G = \sum_{i=1}^{n_c} p_i(1p_i)$
 Boosting (additive model)
Boosting is a very simple algorithm when compared to either neural networks or SVMs and as a result requires significantly less resources and time to train, and often outperforms them as well. Another favourable characteristic of boosting, is that it seems to be resistant to overfitting. Boosting gradually reduces the training error exponentially fast. Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a biasreduction technique.
Coordinate descent on minimizing the exponential loss with respect to the weights .
A widely acknowledged explanation is to view this process as an additive logistical regression.
In Gradient Boosting,“shortcomings” are identified by gradients.
Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost
Deep Learning
METRICS
metrics of binary classification
Sensitivity = Recall = True positive rate
Specificity = True negative rate
Multiclass classification
Regression (multiple and multivariate)
Clustering and other dimension reduction
Optimization Methods
 Coordinate descent
 Gradient descent