- top k frequent words
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
- HDFS(Hadoop distributed file system): data storage (data split and data replication)
- Map Reduce(data processing): how to leverage job; how do nodes communicate; how to deal with node failure. All we care about is input and output.
· Mapper: splits into small chunks
· Reducer: combines data from Mapper
- HBase: an open source, non-relational, distributed database
WHAT IS DOCKER?
Docker is a container, which can package our application into a standardized unit, and share host operating system. A container is not the same as a virtual machine.
HADOOP LINUX COMMANDS
hdfs dfs -lds/
hdfs dfs -mkdir/input
hdfs dfs -put ….txt /input
hdfs dfs -ls /input
hdfs dfs -rmr /output
HADOOP VARIABLE TYPES
- Text: a Writable class that wraps a java String (可序列化的对象，在此传输时可自动序列化，反序列化，速度快，size小)
- Longwritable: a Writable class that wraps a java long
- Intwritable: a Writable class that wraps a java int (支持read, write)
- Setup method: only called once while initializing hadoop
- Configurations: all the arguments passed from command line are installed in configuration (system properties in java)
- DBWritable: Objects that are read from/written to a database
- DoubleWritable: serializable double??
- Language Model: N-Gram Model, predict N-Gram based on N-Gram
- N-Gram: an n-gram is a contiguous sequence of n items from a given sequence of text or speech
- Work flow: ·Pre-process the raw data: read each document line by line, remove non-alphabetical elements.
- First Map Reduce job: build N-Gram model library
input data(offset of a line in the whole document, the text) line by line, pre-process them, and output all existing i-Gram phrases (1 < i <= N) with repetition.
· Reducer: count the number of unique existing phrases
- Second Map Reduce job:
· Mapper: <key, value>, key: starting/phrase; value: following N-Gram with count
· Reducer: choose top K from probability/frequency
HOW TO DEBUG
- Use “enum” when we know about the input data, context.getCounter().increment();
- context.getCounter(“wordcount”, word);
HDFS(Hadoop Distributed File System)
File system is used to store raw data, which is different from data base (store processed data and query data). HDFS is a fault-tolerant file system designed to run on inexpensive hardware.
- Split and replication (easier and faster to process data on multiple machines)
HDFS uses master-slave model. Master will not transport data, because it will be a bottleneck. Master(the brain) will decide which slave node to read/write, then client will talk to slave node. Slave node stores blocks of small files.
HDFS client divides data file into blocks.
Master stores where to read and write. Metadata (file blocks, where are the blocks).
If one slave node fails, data replication helps to avoid from data loss. (fault tolerant)
We have a checkpoint node to copy all the data from master node per hour, so if the master node fails, we restart the system, and read the metadata from the checkpoint node.
- Write (internal data queue)
- Read Read from the closest slave node, even thought the data blocks on each slave mode are the same.
Website with higher PageRank will pass higher weight
- Quantitative assumption: transition matrix
- Qualitative assumption: page rank (initialization)
- Edge Case
- Work flow
two Map Reduce jobs:
- First map reduce job
- Mapper 1
input: transition relations between web pages (from … to …,…,…)
output: from… to … key value pair, the value is the relation(probability)
- Mapper 2
input: page rank.txt
output: initial page rank values
- Reducer: unit multiplication
input: output from the two mappers (a column in the transition matrix and the previous page rank)
output: the tag of a “to” website, part of its updated page rank
2. The second map reduce job
- Mapper: Read file generated from last MR
- Reducer: Summation
Use hashmap to preprocess data: get word and frequency
Use Priority Queue to get Top K
- if the data set is small, we can process is on a single node: Pair(String key, int frequency), private Comparator<Pair> pairComparator Time complexity O(n + nlogk) min-heap (treemap) Space complexity O(|n| + k)
- 10T data, then we cannot process then on a single node, we need to process them on multiple nodes● Cut into small files, divide by hash value● Dispatch small files to different machines(node)● Get topk from each machine
● Calculate topk from the topk combination
- If there are multiple machines, each machine has data file, then we need to rehash
- Realtime TopK with Low QPS(quantity per second):Use treemap to replace PQ● Order by value● Support all the functions in PQ● Support find and delete by key
- Realtime TopK with high QPS: if one key is too hot, writing QPS is very heavy on one node, we need to use cache to deal with the confliction of accuracy and latency. block treemap for several seconds, and then read the data in the cache. Delete the low frequency words
- Approx TopK Algorithm
key = hashvalue (flexible)
bloom filter: use three hash functions, and choose the lowest count from hashmap
- Use map reduce
User Collaborative Filtering: A form of collaborative filtering based on the similarity between users calculated using people’s ratings of those items.
Item Collaborative Filtering: A form of collaborative filtering based on the similarity between items calculated using people’s ratings of those items. (# items << # users; Item will not change frequently, lowering calculation; Using user’s historical data, more convincing)
- Data processing (divide data by user): user ID; movie ID; ratings
- Co-occurrence matrix: A co-occurrence matrix is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset. (how many people watch the two movies)
- First map reduce job: data divide by user
- Mapper 1:
output: user ID, movie ID, ratings
2. The second map reduce job: build co-occurrence matrix
input: user ID, movie ID, ratings
output: for each user, movie 1: movie 2, value = 1 (user watches both movie 1 and movie 2)
input: movie1:movie2 value = iterable<1, 1, 1>
output: calculate each two movies have been watched by how many people (summation), movie1: movie2 value = relation
3. The third MR job: normalize co-occurrence matrix
summation, divide and transpose
input: movieA:movieB \t relation
output: key = movieA, value = movieB = relation
input: key = movieA, value = movieB = relation
4. The forth MR job: matrix multiply (co-occurrence matrix and ranking matrix)
5. The fifth MR job: Summation