data science · machine learning

Deep Learning – NLP

https://zhuanlan.zhihu.com/p/49271699 Home https://jalammar.github.io/ Neural Language Model predict the next work, replace HMM, RNN LSTM different architectures stateful LSTM: memorize last batch. dependent stateless LSTM: update parameter in batch one, when batch two, initialize hidden states and cell states to zero. batch to batch. independent in different batches Word2Vec: CBOW, skip-grams | Glove (cannot solve the… Continue reading Deep Learning – NLP

data science · machine learning

Interpretability of ML

https://github.com/jphall663/awesome-machine-learning-interpretability https://christophm.github.io/interpretable-ml-book/index.html Global Interpretability Partial Dependence and Partial Dependence Plot (PDP) Individual Conditional Expectation (ICE) Total and two-way H Statistics Global Feature importance using permutation Global surrogate model Local Interpretability Local Interpretable Model-agnostic Explanations (LIME) Shapley additive explanation An intuitive way to understand the Shapley value is the following illustration: The feature values enter a… Continue reading Interpretability of ML

data science · machine learning

Classification

Elements of a model Objective Model structure (e.g. variables, formula, equation, parameters) Model assumptions Parameter estimates and interpretation Model fit (e.g. goodness-of-fit tests and statistics) Model selection LDA generative model model p(x|y) as multivariate Gaussian, Both classes have the same covariance matrix, Σ QDA Each class has their own Σ Naive Bayes generative model Assume the xj… Continue reading Classification

Big Data · data science · machine learning · programming

Apache Hadoop (projects)

QUESTIONS setInputFormat comparator top k frequent words HADOOP SYSTEM Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. HDFS(Hadoop distributed file system): data storage (data split and data replication) Map Reduce(data processing): how to leverage job; how do nodes communicate; how to deal with node… Continue reading Apache Hadoop (projects)

data science · machine learning

Time Series Analysis

TIME SERIES BASICS Difference between regression and time series: time series are not necessarily independent and not necessarily identically distributed.  They are lists of observations where the ordering matters.  Ordering is very important because there is dependency and changing the order could change the meaning of the data. Characteristics: Is there a trend,  on average, the… Continue reading Time Series Analysis

data science · machine learning

Stats and Probability Theory

How to choose a statistical model? Are My data Normally Distributed? Problems: Excess kurtosis (forth moment, very big tails, due to extreme values away from the mean) Excess skewness (third moment, lopsided) Others: lognormal (a RV whose logarithm is normally-distributed), uniform, weibull, exponential… Routine: Histogram (largely depends on the bin size) Stem and leaf plots… Continue reading Stats and Probability Theory

data science · machine learning

When we talk about data science, what we talk about

One and a half year ago, I did not know what is logistic regression. Now, I love machine learning and data science, and decide to delve into it for my future career. I can still remember the first time I audited a machine learning class in Harvard. Without any basic knowledge in algorithms, I still found… Continue reading When we talk about data science, what we talk about