Uncategorized

Classification

Elements of a model

  • Objective
  • Model structure (e.g. variables, formula, equation, parameters)
  • Model assumptions
  • Parameter estimates and interpretation
  • Model fit (e.g. goodness-of-fit tests and statistics)
  • Model selection

LDA

Naive Bayes

Decision Tree

Logistic Regression (one of GLM)

Variables:

    • Y: a binary response variable. Yi = 1 if the trait is present in observation (person, unit, etc…) i; Yi = 0 if the trait is NOT present in observation i
  • X = (X1, X2, …, Xk) be a set of explanatory variables which can be discrete, continuous, or a combination. xi is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.

Model:

 Capture

Assumptions:

  • The data Y1, Y2, …, Yn are independently distributed, i.e., cases are independent.
  • Distribution of Yi is Bin(ni, πi), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
  • Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β0 + βX.
  • Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
  • The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
  • Errors need to be independent but NOT normally distributed.
  • It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
  • Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

Model Fit:

  • Overall goodness-of-fit statistics of the model; we will consider:
    1. Pearson chi-square statistic, X2
    2. Deviance, G2 and Likelihood ratio test and statistic, ΔG2
    3. Hosmer-Lemeshow test and statistic
  • Residual analysis: Pearson, deviance, adjusted residuals, etc…
  • Overdispersion

Parameter Estimation:

The maximum likelihood estimator (MLE) for (β0, β1) is obtained by finding (β^0,β^1)(β^0,β^1) that maximizes:

L(β0,β1)=i=1Nπyii(1πi)niyi=i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)L(β0,β1)=∏i=1Nπiyi(1−πi)ni−yi=∏i=1Nexp{yi(β0+β1xi)}1+exp(β0+β1xi)

In general, there are no closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as Newton-Raphson (NR), orIteratively re-weighted least squares (IRWLS)

Support Vector Machines

  • Functional and geometric margins

Functional margin: 1, if y (i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^t x + b to be a large positive number; if y(i)(w^ tx + b) > 0, then our prediction on this example is correct

1

this point lies on the decision boundary, and all points x on the decision boundary satisfy the equation w^t x + b = 0.

1

2

3

 

  • The optimal margin classifier
  1. Convex quadratic objective and only linear constraints (quadratic programming)1
  2. Langrange duality         primal form and dual form1Lagrange multiplier2Dual optimization3

    Prediction:αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction4

  • Kernel: we can get SVMs to learn in the high dimensional feature space space given by φ, but without ever having to explicitly find or represent vectors φ(x); valid kernels (symmetric and semi positive definite)
  • Non-separable case and regularization1 The parameter C controls the relative weighting between the twin goals of making the ||w||^2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1 23the only change to the dual problem is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ αi ≤ C.
  • Multi-class SVMs

What sort of optimization problem would you be solving to train a support vector machine? maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form

The regularization parameter values for parameter C, which indicates the relaxation of the restriction conditions in soft margin SVM, and parameter σ , which indicates the spread of the gauss kernel distribution. With SVM, when a new feature is added, it also requires a support vector (SV) for use in classification in that dimension.

With a boosting algorithm such as AdaBoost, learning proceeds so that weight vectors are expressed with the smallest possible number of features. As a result, classification is performed with few features, and it is possible to analyze features that have a high contribution rate. If we consider the weak learner of AdaBoost is just thresholding, then when we weighting and selecting the learners, we are selecting the features at the same time.

On the other hand, SVM attempts to express the weight vectors using the smallest possible number of cases, making it difficult to perform analysis of the features from the learning space.  If we have enough amount of data from all the classes for SVM, then if we do random selection from the whole data set, it will not largely affect the performance of SVM. If we are adding a new irrelevant feature, it will affect the performance of SVM, because we are changing the feature space of the data set.

In general terms SVMs are very good when you have a huge number of features. you have to be careful with how your features are scaled. Good when data are sparse. Not good when data are imbalanced. Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem in order to find a separation hyperplane, which causes an intensive computational complexity. Some other disadvantages.

kernal: linear, Gauss, sigmoid, polynomial (order, kernel scale)

Ensemble Learning

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

  • Bootstrap: any test or metric that relies on random sampling with replacement. Use the new samples to estimate statistics of the population.

Advantage: simplicity;  control and check the stability of the results

   Disadvantage: The apparent simplicity may conceal the fact that important                assumptions are being made when undertaking the bootstrap analysis (e.g.    independence of samples) where these would be more formally stated in other approaches

1

  • Bagging (bootstrap aggregation): after bootstrap the samples, let all hypotheses get to have a vote to make a prediction

   advantage: reduce variance (use unstable/complex learners with high variance and    low bias, i.e. decision trees, ann, nearest-neighbors)

   disadvantage: increase bias???2.PNG

  • Bagging decision trees

When bagging with decision trees, we are less concerned about individual trees over-fitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

  • Random Forests
    There are two famous processes adopted in RF. The first step is bootstrap, where the classification trees are constructed concurrently with random sampling the data from dataset with replacement that forms into new training sets independently. Next step is bagging, which combines each tree into a classification forest and its result is decided.
    For each decision tree in the random forest, unpruned trees grow to the largest. The root of every tree consists of different new training subsets created by bootstrap. Each node on branches is split using the best split among all features of datasets (or a randomly selected subset of features). The element of each leave has the same class label. The class labels of final leaves stand for the detection result of new data.3

Use of the Strong Law of Large Numbers shows that they always converge so that
overfitting is not a problem.

Extremely randomized forests

           4

Variable importance

 we can calculate how much the error function drops for a variable at each split point. (information gain/ gini impurity) or mean square error…

Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

$latex G = \sum_{i=1}^{n_c} p_i(1-p_i)$

 1.PNG

Boosting is a very simple algorithm when compared to either neural networks or   SVMs and as a result requires significantly less resources and time to train, and           often outperforms them as well. Another favourable characteristic of boosting, is         that it seems to be resistant to over-fitting. Boosting gradually reduces the training error exponentially fast. Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique.

Coordinate descent on minimizing the exponential loss with respect to the weights     \alpha.

A widely acknowledged explanation is to view this process as an additive logistical regression.

In Gradient Boosting,“shortcomings” are identified by gradients.

         Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost

Deep Learning

METRICS

metrics of binary classification

capture

Sensitivity = Recall = True positive rate

Specificity = True negative rate

Multiclass classification

Regression (multiple and multivariate)

Clustering and other dimension reduction

Optimization Methods

  • Coordinate descent
  • Gradient descent

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s