different cache layers
? actions: foreach(println), take
? when to use dataframe, when to use rdd, when to use datasets
MapReduce vs. Spark:
MapReduce – between MapReduce jobs, always need to write data back to HDFS (disk, need to chunk, distribute), bottleneck
Spark – memory based, cache data in memory. or part on disk if does not fit in memory (10x – 100x faster)
RDD: distributed data parallelism
latency cannot be masked: memory – disk – network
Hadoop fault tolerance: save intermediate results to disk to recover from potential failures; Spark takes advantage of scala’s functional programming nature, creates immutable data in-memory, it recovers from failures by replaying functional transformations over original datasets.
Transformations are lazy; Actions are eager.
Cache in memory: persist() (customized); cache() (sits in memory, being used again)
Sparks traverses through the RDD once.
We might be unknowingly re-evaluating several transformations.
Master (Driver Program/Main program). We interact with cluster manager (Mesos, YARN), it is our handle. This cluster manager allocates resources. Keep track of the lineage. Retry 4 times if fails.
Worker node, we cannot see from the master. Action executes in worker node.
For example, foreach will print in each individual worker node, we cannot see it from the master node; take involves communication between master node and worker node, so we can see the results.
Pair RDD: act on each key in parallel
- groupByKey: result in single key-value pair each key, need to move around pairs across nodes
- reduceByKey: groupByKey and reduce(reduce values by key), more efficient (reduce on the map side first, reduce again after shuffle)
- mapValues: applies functions only values of pair RDD
data be shuffled over the network
Partition (on pair RDDs)
By default, the number of partitions equals to the number of cores on all executor nodes
- hash partitioning: spread data evenly across partitions based on the key
partition = key.hashCode() % numPartitions
- range partitioning: tuples with keys in the same range appear on the same machine
keys are partitioned base on and ordering of keys | a set of sorted range of keys
range partition must be PERSISTED, otherwise the partitioning is repeated applied (involving shuffling) each time the partitioned RDD is used
Other transformations will not result in (discard) partitioners, because the the keys might be changed in those transformations, and partitioning depends on the key value.
Optimizing with partitioners
DataFrame: data lineage, RDD at the backend, logical understanding/plan
Spark DataFrame Internals: (catalog) – logical plan – optimized logical plan – physical plans – cost model
push filter to the data base side, read in less data