different cache layers

? actions: foreach(println), take

? when to use dataframe, when to use rdd, when to use datasets

MapReduce vs. Spark:

MapReduce – between MapReduce jobs, always need to write data back to HDFS (disk, need to chunk, distribute), bottleneck

Spark – memory based, cache data in memory. or part on disk if does not fit in memory (10x – 100x faster)



RDD: distributed data parallelism

latency cannot be masked: memory – disk – network

Hadoop fault tolerance: save intermediate results to disk to recover from potential failures; Spark takes advantage of scala’s functional programming nature, creates immutable data in-memory, it recovers from failures by replaying functional transformations over original datasets.

Transformations are lazy; Actions are eager.

Cache in memory: persist() (customized); cache() (sits in memory, being used again)

Sparks traverses through the RDD once.

We might be unknowingly re-evaluating several transformations.

Master (Driver Program/Main program). We interact with cluster manager (Mesos, YARN), it is our handle. This cluster manager allocates resources. Keep track of the lineage. Retry 4 times if fails.

Worker node, we cannot see from the master. Action executes in worker node.

For example, foreach will print in each individual worker node, we cannot see it from the master node; take involves communication between master node and worker node, so we can see the results.

Pair RDD: act on each key in parallel


  • groupByKey: result in single key-value pair each key, need to move around pairs across nodes
  • reduceByKey: groupByKey and reduce(reduce values by key), more efficient (reduce on the map side first, reduce again after shuffle)
  • mapValues: applies functions only values of pair RDD
  • keys
  • join
  • leftOuterJoin


  • countByKey

joins: transformations


data be shuffled over the network

Partition (on pair RDDs)

By default, the number of partitions equals to the number of cores on all executor nodes

  • hash partitioning: spread data evenly across partitions based on the key

partition = key.hashCode() % numPartitions

  • range partitioning: tuples with keys in the same range appear on the same machine

keys are partitioned base on and ordering of keys | a set of sorted range of keys

range partition must be PERSISTED, otherwise the partitioning is repeated applied (involving shuffling) each time the partitioned RDD is used

Screen Shot 2019-04-03 at 9.41.37 PM

Other transformations will not result in (discard) partitioners, because the the keys might be changed in those transformations, and partitioning depends on the key value.

Optimizing with partitioners



DataFrame: data lineage, RDD at the backend, logical understanding/plan

Spark DataFrame Internals: (catalog) – logical plan – optimized logical plan – physical plans – cost model

push filter to the data base side, read in less data



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s