Big Data · data science · machine learning · programming

Apache Hadoop (projects)


  • setInputFormat
  • comparator
  • top k frequent words


Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

  • HDFS(Hadoop distributed file system): data storage (data split and data replication)
  • Map Reduce(data processing): how to leverage job; how do nodes communicate; how to deal with node failure. All we care about is input and output.

· Mapper: splits into small chunks

· Reducer: combines data from Mapper



  • HBase: an open source, non-relational, distributed database



Docker is a container, which can package our application into a standardized unit, and share host operating system. A container is not the same as a virtual machine.




hdfs dfs -lds/

hdfs dfs -mkdir/input

hdfs dfs -put ….txt /input

hdfs dfs -ls /input

hdfs dfs -rmr /output


  • Text:  a Writable class that wraps a java String (可序列化的对象,在此传输时可自动序列化,反序列化,速度快,size小)
  • Longwritable: a Writable class that wraps a java long
  • Intwritable: a Writable class that wraps a java int (支持read, write)
  • Context
  • Setup method: only called once while initializing hadoop
  • Configurations: all the arguments passed from command line are installed in configuration (system properties in java)
  • DBWritable: Objects that are read from/written to a database
  • DoubleWritable: serializable double??


  • Language Model: N-Gram Model, predict N-Gram based on N-Gram
  • N-Gram: an n-gram is a contiguous sequence of n items from a given sequence of text or speech
  • Work flow:                                                                                                                                              ·Pre-process the raw data: read each document line by line,  remove non-alphabetical elements.capture
  • First Map Reduce job:  build N-Gram model library

· Mapper:

input data(offset of a line in the whole document, the text) line by line, pre-process them, and output all existing i-Gram phrases (1 < i <= N) with repetition.

· Reducer: count the number of unique existing phrases

  • Second Map Reduce job:

· Mapper: <key, value>, key: starting/phrase; value: following N-Gram with count

· Reducer: choose top K from probability/frequency


  • Use “enum” when we know about the input data, context.getCounter().increment();
  • context.getCounter(“wordcount”, word);

HDFS(Hadoop Distributed File System) 

File system is used to store raw data, which is different from data base (store processed data and query data). HDFS is a fault-tolerant file system designed to run on inexpensive hardware.

  • Split and replication (easier and faster to process data on multiple machines)CaptureCapture


HDFS uses master-slave model. Master will not transport data, because it will be a bottleneck.  Master(the brain) will decide which slave node to read/write, then client will talk to slave node. Slave node stores blocks of small files.

HDFS client divides data file into blocks.Capture

Master stores where to read and write. Metadata (file blocks, where are the blocks).

If one slave node fails, data replication helps to avoid from data loss. (fault tolerant)Capture

We have a checkpoint node to copy all the data from master node per hour, so if the master node fails, we restart the system, and read the metadata from the checkpoint node.

  • Write (internal data queue)Capture
  • ReadCapture Read from the closest slave node, even thought the data blocks on each slave mode are the same.


Website with higher PageRank will pass higher weight

  • Quantitative assumption: transition matrix
  • Qualitative assumption: page rank (initialization)
  • Edge Case
  • Work flowCapture

two Map Reduce jobs:

  1. First map reduce job
  • Mapper 1

input: transition relations between web pages (from … to …,…,…)

output: from… to … key value pair, the value is the relation(probability)

  • Mapper 2

input: page rank.txt

output: initial page rank values

  • Reducer: unit multiplicationCapture

input: output from the two mappers (a column in the transition matrix and the                               previous page rank)

output: the tag of a “to” website, part of its updated page rank

2. The second map reduce job

  • Mapper: Read file generated from last MR
  • Reducer: Summation



Use hashmap to preprocess data: get word and frequency

Use Priority Queue to get Top K

  • if the data set is small, we can process is on a single node: Pair(String key, int frequency), private Comparator<Pair> pairComparator                                                                Time complexity O(n + nlogk) min-heap (treemap)                                                                        Space complexity O(|n| + k)
  • 10T data, then we cannot process then on a single node, we need to process them on multiple nodesCapture● Cut into small files, divide by hash value● Dispatch small files to different machines(node)● Get topk from each machine

    ● Calculate topk from the topk combination

  • If there are multiple machines, each machine has data file, then we need to rehash
  • Realtime TopK with Low QPS(quantity per second):Use treemap to replace PQ● Order by value● Support all the functions in PQ● Support find and delete by key
  • Realtime TopK with high QPS: if one key is too hot, writing QPS is very heavy on one node, we need to use cache to deal with the confliction of accuracy and latency. block treemap for several seconds, and then read the data in the cache. Delete the low frequency wordsCapture
  • Approx TopK Algorithm

key = hashvalue (flexible)

bloom filter: use three hash functions, and choose the lowest count from hashmap

  • Use map reduceCaptureCapture1


User Collaborative Filtering: A form of collaborative filtering based on the similarity between users calculated using people’s ratings of those items.

Item Collaborative Filtering: A form of collaborative filtering based on the similarity between items calculated using people’s ratings of those items. (# items << # users; Item will not change frequently, lowering calculation; Using user’s historical data, more convincing)

  • Data processing (divide data by user): user ID; movie ID; ratings
  • Co-occurrence matrix:  A co-occurrence matrix is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset. (how many people watch the two movies)
  1. First map reduce job: data divide by user
  • Mapper 1:


output: user ID, movie ID, ratings

  • Reducer:



2. The second map reduce job: build co-occurrence matrix

  • Mapper:

input: user ID, movie ID, ratings

output: for each user, movie 1: movie 2, value = 1 (user watches both movie 1 and movie 2)

  • Reducer:

input: movie1:movie2 value = iterable<1, 1, 1>

output: calculate each two movies have been watched by how many people (summation), movie1: movie2 value = relation

3. The third MR job: normalize co-occurrence matrixCapture

summation, divide and transpose

  • mapper

input: movieA:movieB \t relation

output: key = movieA, value = movieB = relation

  • reducer

input: key = movieA, value = movieB = relation

output: normalized

4. The forth MR job: matrix multiply (co-occurrence matrix and ranking matrix)


5. The fifth MR job: Summation

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s