Skip to content

Margus Roo –

If you're inventing and pioneering, you have to be willing to be misunderstood for long periods of time

  • Cloudbreak Autoscale fix
  • Endast

Apache Spark some hints

Posted on April 2, 2017 - February 28, 2018 by margusja
  • Stages – pipelined jobs RDD -> RDD -> Rdd (narrow)
  • Suffle – The transfer of data between stages  (wide)
  • Debug – to visualise how do we build RDD – input.toDebugString (input is RDD)
  • Cache expensive RDDs after shuffle
  • Use Accumulators (counters inside executors) to debug RDD’s – Values via UI
  • Pipeline as much as possible (rdd->map->filter) one stage
  • split into stages to reorganise RDDs
  • Avoid shuffle large amount of RDDs
  • Parditioneid 2xCores in cluster
  • Max – task should not take no longer than 100ms
  • Memory problem – dmesg oom-killer
  • Use build in aggregateByKey noy your own aggregation not groupBy
  • Filter as early you can
  • Use KyroSerializer
  • SSD disks YARN local dir (shuffle is faster)
  • USE High level API’s (DataFrame for core porcessing)
  • rdd.reduceByKey(func) is better than rdd.groupByKey() and reduce
  • Use data.join().explain()

    RDD.distinct – Shuffles!

  • Learning Spark (e-book)

 

scala> List( 1, 2, 4, 3 ).reduce( (x,y) => x + y )
res22: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => x+y)
res24: Int = 10

scala> List( 1, 2, 4, 3 ).fold(0)((x,y) => { if (x > y) x else y } )
res25: Int = 4

scala> List( 5, 2, 4, 3 ).reduce( (a,b) => { if (a > b) a else b } )
res29: Int = 5

 

Avoid duplicates during joins

https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

Posted in Linux

Post navigation

Apache-Spark 2.x + Yarn – some errors and solutions
Just a pic

The Master

Categories

  • Apache
  • Apple
  • Assembler
  • Audi
  • BigData
  • BMW
  • C
  • Elektroonika
  • Fun
  • Hadoop
  • help
  • Infotehnoloogia koolis
  • IOT
  • IT
  • IT eetilised
  • Java
  • Langevarjundus
  • Lapsed
  • lastekodu
  • Linux
  • M-401
  • Mac
  • Machine Learning
  • Matemaatika
  • Math
  • MSP430
  • Muusika
  • neo4j
  • openCL
  • Õpetaja identiteet ja tegevusvõimekus
  • oracle
  • PHP
  • PostgreSql
  • ProM
  • R
  • Turvalisus
  • Varia
  • Windows
Proudly powered by WordPress | Theme: micro, developed by DevriX.