Apache-Spark 2.x + Yarn – some errors and solutions

Problem:
2017-03-24 09:15:55,235 ERROR [dispatcher-event-loop-2] cluster.YarnScheduler: Lost executor 2 on bigdata38.webmedia.int: Container marked as failed: container_e50_1490337980512_0004_01_000003 on host: bigdata38.webmedia.int. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e50_1490337980512_0004_01_000003
Exit code: 52
Container exited with a non-zero exit code 52
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 – i.e. an OutOfMemoryError

Problem:

2017-03-24 09:33:49,251 WARN [dispatcher-event-loop-4] cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e50_1490337980512_0006_01_000002 on host: bigdata33.webmedia.int. Exit status: -100. Diagnostics: Container released on a *lost* node

2017-03-24 09:33:46,427 WARN nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/local error, used space above threshold of 90.0%, removing from list of valid directories

2017-03-24 09:33:46,427 WARN nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/log error, used space above threshold of 90.0%, removing from list of valid directories

2017-03-24 09:33:46,427 INFO nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:logDiskStatus(373)) – Disk(s) failed: 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log

2017-03-24 09:33:46,428 ERROR nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:updateDirsAfterTest(366)) – Most of the disks failed. 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log

Problem:

2017-03-24 09:40:45,618 WARN [dispatcher-event-loop-9] scheduler.TaskSetManager: Lost task 53.0 in stage 2.2 (TID 440, bigdata38.webmedia.int): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_e50_1490337980512_0006_01_000010 on host: bigdata38.webmedia.int. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.

May-be problem (if it takes long time – usually should be less than 50ms):

2017-03-24 11:46:41,488 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-0 from (begin) .. (end); will stop at (end)

2017-03-24 11:46:41,489 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from (begin) .. (end); will stop at ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1

2017-03-24 11:46:41,499 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1 .. (end); will stop at (end)

2017-03-24 11:46:41,500 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:run(1023)) – Full compaction cycle completed in 20 msec

yarn.resourcemanager.leveldb-state-store.compaction-interval-secs

yarn.timeline-service.leveldb-timeline-store.path

Problem:

ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

Out-Of-Memory error

17/03/31 15:31:12 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker-26,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded