Skip to content

If you're inventing and pioneering, you have to be willing to be misunderstood for long periods of time

Cloudbreak Autoscale fix
Endast

Apache Spark some hints

Posted on April 2, 2017 - February 28, 2018 by margusja

Stages – pipelined jobs RDD -> RDD -> Rdd (narrow)
Suffle – The transfer of data between stages (wide)
Debug – to visualise how do we build RDD – input.toDebugString (input is RDD)
Cache expensive RDDs after shuffle
Use Accumulators (counters inside executors) to debug RDD’s – Values via UI
Pipeline as much as possible (rdd->map->filter) one stage
split into stages to reorganise RDDs
Avoid shuffle large amount of RDDs
Parditioneid 2xCores in cluster
Max – task should not take no longer than 100ms
Memory problem – dmesg oom-killer
Use build in aggregateByKey noy your own aggregation not groupBy
Filter as early you can
Use KyroSerializer
SSD disks YARN local dir (shuffle is faster)
USE High level API’s (DataFrame for core porcessing)
rdd.reduceByKey(func) is better than rdd.groupByKey() and reduce
Use data.join().explain()
RDD.distinct – Shuffles!
Learning Spark (e-book)

scala> List( 1, 2, 4, 3 ).reduce( (x,y) => x + y )
res22: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => x+y)
res24: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => { if (x > y) x else y } )
res25: Int = 4

scala> List( 5, 2, 4, 3 ).reduce( (a,b) => { if (a > b) a else b } )
res29: Int = 5

Avoid duplicates during joins

https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

Posted in Linux

Post navigation

Apache-Spark 2.x + Yarn – some errors and solutions

The Master

Search for:

Categories

Proudly powered by WordPress | Theme: micro, developed by DevriX.