Problem:
2017-03-24 09:15:55,235 ERROR [dispatcher-event-loop-2] cluster.YarnScheduler: Lost executor 2 on bigdata38.webmedia.int: Container marked as failed: container_e50_1490337980512_0004_01_000003 on host: bigdata38.webmedia.int. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e50_1490337980512_0004_01_000003
Exit code: 52
Container exited with a non-zero exit code 52
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 – i.e. an OutOfMemoryError
Problem:
2017-03-24 09:33:49,251 WARN [dispatcher-event-loop-4] cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e50_1490337980512_0006_01_000002 on host: bigdata33.webmedia.int. Exit status: -100. Diagnostics: Container released on a *lost* node
2017-03-24 09:33:46,427 WARN nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/local error, used space above threshold of 90.0%, removing from list of valid directories
2017-03-24 09:33:46,427 WARN nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/log error, used space above threshold of 90.0%, removing from list of valid directories
2017-03-24 09:33:46,427 INFO nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:logDiskStatus(373)) – Disk(s) failed: 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log
2017-03-24 09:33:46,428 ERROR nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:updateDirsAfterTest(366)) – Most of the disks failed. 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log
Problem:
2017-03-24 09:40:45,618 WARN [dispatcher-event-loop-9] scheduler.TaskSetManager: Lost task 53.0 in stage 2.2 (TID 440, bigdata38.webmedia.int): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_e50_1490337980512_0006_01_000010 on host: bigdata38.webmedia.int. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.
May-be problem (if it takes long time – usually should be less than 50ms):
2017-03-24 11:46:41,488 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-0 from (begin) .. (end); will stop at (end)
2017-03-24 11:46:41,489 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from (begin) .. (end); will stop at ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1
2017-03-24 11:46:41,499 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1 .. (end); will stop at (end)
2017-03-24 11:46:41,500 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:run(1023)) – Full compaction cycle completed in 20 msec
yarn.resourcemanager.leveldb-state-store.compaction-interval-secs
yarn.timeline-service.leveldb-timeline-store.path
Problem:
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
Out-Of-Memory error
17/03/31 15:31:12 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker-26,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded