- Stages – pipelined jobs RDD -> RDD -> Rdd (narrow)
- Suffle – The transfer of data between stages (wide)
- Debug – to visualise how do we build RDD – input.toDebugString (input is RDD)
- Cache expensive RDDs after shuffle
- Use Accumulators (counters inside executors) to debug RDD’s – Values via UI
- Pipeline as much as possible (rdd->map->filter) one stage
- split into stages to reorganise RDDs
- Avoid shuffle large amount of RDDs
- Parditioneid 2xCores in cluster
- Max – task should not take no longer than 100ms
- Memory problem – dmesg oom-killer
- Use build in aggregateByKey noy your own aggregation not groupBy
- Filter as early you can
- Use KyroSerializer
- SSD disks YARN local dir (shuffle is faster)
- USE High level API’s (DataFrame for core porcessing)
- rdd.reduceByKey(func) is better than rdd.groupByKey() and reduce
- Use data.join().explain()
RDD.distinct – Shuffles!
- Learning Spark (e-book)
scala> List( 1, 2, 4, 3 ).reduce( (x,y) => x + y )
res22: Int = 10
scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => x+y)
res24: Int = 10
scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => { if (x > y) x else y } )
res25: Int = 4
scala> List( 5, 2, 4, 3 ).reduce( (a,b) => { if (a > b) a else b } )
res29: Int = 5
Avoid duplicates during joins
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html