machine learning

# Feature Engineering

### Data Pre-processing(Transformation)

Normalization

sigmoid normalization

0-1 normalization ((bla – min(bla)) / ( max(bla) – min(bla) ))

z-score

Gaussian normalization (Gaussian kernel)

Box-cox transformation

log transformation

### Feature Engineering

image

speech

text

time series: entropy, approximate entropy, sample entropy

### Data Visualization

1. Statistics
2. Histogram
3. Density estimation: kernal density estimation (Parzen–Rosenblatt window)

### Feature Selection

How to choose a proper feature selection method for your data? Go from easier ones to complicated ones, go from linear ones to non-linear ones.
The combinations of individually good features do not necessarily lead to good classification performance.  “The m best features are not the best m features”

Similarity Measure

Euclidean distance

Cosine distance

Gaussian distance

is a multi-dimensional generalization of the idea of measuring how many standard deviations away a point P is from the mean of the distribution D. The Mahalnobis distance transforms the random vector into a zero mean vector with an identity matrix for covariance. In that space, the Euclidean distance is safely applied.  It can be used to identify outliers, which are data points away from the distribution of the data. We can consider each feature a time (multivariate reduced to univariate), then the covariance matrix reduces to a diagonal matrix. Then, we can rank the features by the distance, and delete one feature a time to identify the best combination of features by investigating the changes of the metric.

Statistical Tests

Hypothesis testing to test whether the difference of one feature is significant among classes: t-test

Filter Methods

• Correlation
• F-statistics
• Mutual information MI is more general and determines how similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y). I (i) is a measure of dependency between the density of variable xi and the density of the target y. Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other.  I(X; Y) = 0 if and only if X and Y are independent random variables. Moreover, mutual information is nonnegative (i.e. I(X;Y) ≥ 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)). The area contained by both circles is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional entropy H(X|Y). The circle on the right (blue and violet) is H(Y), with the blue being H(Y|X). The violet is the mutual information I(X;Y), which is equivalent to the amount of uncertainty in Y which is removed by knowing X  How to estimate mi of continuous variables: data discretization; density estimation method (e.g., Parze windows with Gaussian window)

• MRMR(Minimum Redundancy Maximum Relevance)

Maximal relevance:  selecting the features with the highest relevance to
the target class c. Relevance is usually characterized in terms of correlation of mutual information. Minimal redundancy: combine MRMR:  We can get a ranking list of all the features, and apply wrapper method with the rank.

Linear Methods

FDA (Fisher’s discriminant analysis)

Tree Based Methods(embedded)

Adaboost with a tree stump (Variable importance is measured by how much error the variable reduced each time it was used in a tree’s split/branch), CART, BART (tree model), random forest

univariate methods,linear models and regularization and random forests for feature selection, stability selection, Recursive feature elimination

Greedy Selection(wrapper)

Greedily select, use the performance of a reliable classifier. Combined with data partition (subsampling)

Dimension Reduction Methods(covered in representation)

Dimension reduction method, which will change the feature space

Regularization/sparsity(embedded)

Ensemble Feature Selection

• Harmony search
• Data Reliability Based Feature Selection: a feature is considered reliable (or relevant) if its values are tightly grouped together.
• Stability selection:  apply a feature selection algorithm on different subsets of data and with different subsets of features (bootstrap)
• Boosting

!Categorical features

Random Permutation

## Imbalanced Data

Under-Sampling

• Condensed Nearest-Neighbor
• One-sided Selection

Over-Sampling

• SMOTE: First it finds the n-nearest neighbours in the minority class for each of the samples in the class . Then it draws a line between the the neighbours an generates random points on the lines. • ADASYN: After creating those sample it adds a random small values to the points thus making it more realistic. In other words instead of all the sample being linearly correlated to the parent they have a little more variance in them i.e they are bit scattered.

Cross Validation for imbalanced data

Ensemble different resampled datasets: When deal with imbalanced data set, create multiple balanced data sets from the original imbalanced data set via sampling, and subsequently evaluate feature subsets using an ensemble of base classifiers each trained on a balanced data set. And add ratio.

Cluster the majority class: Instead of relying on random samples to cover the variety of the training samples, he suggests clustering the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid (centre of cluster) is kept. The model is then trained with the rare class and the medoids only.

under-sampling, over-sampling, increasing minority samples and decreasing majority samples simultaneously, synthesize “new” samples from the minority class, bootstrap

GAN for data augmentation

## Automated Feature Engineering

DFS (Deep Feature Synthesis)