Apache Spark some hints

  • Stages – pipelined jobs RDD -> RDD -> Rdd (narrow)
  • Suffle – The transfer of data between stages  (wide)
  • Debug – to visualise how do we build RDD – input.toDebugString (input is RDD)
  • Cache expensive RDDs after shuffle
  • Use Accumulators (counters inside executors) to debug RDD’s – Values via UI
  • Pipeline as much as possible (rdd->map->filter) one stage
  • split into stages to reorganise RDDs
  • Avoid shuffle large amount of RDDs
  • Parditioneid 2xCores in cluster
  • Max – task should not take no longer than 100ms
  • Memory problem – dmesg oom-killer
  • Use build in aggregateByKey noy your own aggregation not groupBy
  • Filter as early you can
  • Use KyroSerializer
  • SSD disks YARN local dir (shuffle is faster)
  • USE High level API’s (DataFrame for core porcessing)
  • rdd.reduceByKey(func) is better than rdd.groupByKey() and reduce
  • Use data.join().explain()

    RDD.distinct – Shuffles!

  • Learning Spark (e-book)

 

scala> List( 1, 2, 4, 3 ).reduce( (x,y) => x + y )
res22: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => x+y)
res24: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => { if (x > y) x else y } )
res25: Int = 4

scala> List( 5, 2, 4, 3 ).reduce( (a,b) => { if (a > b) a else b } )
res29: Int = 5

 

Avoid duplicates during joins

https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html