QUESTIONS setInputFormat comparator top k frequent words HADOOP SYSTEM Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. HDFS(Hadoop distributed file system): data storage (data split and data replication) Map Reduce(data processing): how to leverage job; how do nodes communicate; how to deal with node… Continue reading Apache Hadoop (projects)

# Time Series Analysis

TIME SERIES BASICS Difference between regression and time series: time series are not necessarily independent and not necessarily identically distributed.  They are lists of observations where the ordering matters.  Ordering is very important because there is dependency and changing the order could change the meaning of the data. Characteristics: Is there a trend,  on average, the… Continue reading Time Series Analysis

machine learning

# Regression

Simple Linear Regression Objective: model the expected value of a continuous variable, Y, as a linear function of the continuous predictor, X, E(Yi) = β0 + β1xi Model structure: Yi = β0 + β1xi + εi Model assumptions The mean of the response, E(Yi), at each value of the predictor, xi, is a Linear function… Continue reading Regression

# Stats and Probability Theory

How to choose a statistical model? Are My data Normally Distributed? Problems: Excess kurtosis (forth moment, very big tails, due to extreme values away from the mean) Excess skewness (third moment, lopsided) Others: lognormal (a RV whose logarithm is normally-distributed), uniform, weibull, exponential… Routine: Histogram (largely depends on the bin size) Stem and leaf plots… Continue reading Stats and Probability Theory

machine learning

# Representation Learning

From the perspective statistics, many of the methods discussed below can be considered as multivariate analysis methods. I would like to refer it as dimension reduction/unsupervised learning methods from the understanding of machine learning, even not all of them are typicall used as dimension reduction techniques. PRINCINPLE COMPONENT ANALYSIS (PCA) https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues project the original data… Continue reading Representation Learning