Apache Spark some hints

  • Stages – pipelined jobs RDD -> RDD -> Rdd (narrow)
  • Suffle – The transfer of data between stages  (wide)
  • Debug – to visualise how do we build RDD – input.toDebugString (input is RDD)
  • Cache expensive RDDs after shuffle
  • Use Accumulators (counters inside executors) to debug RDD’s – Values via UI
  • Pipeline as much as possible (rdd->map->filter) one stage
  • split into stages to reorganise RDDs
  • Avoid shuffle large amount of RDDs
  • Parditioneid 2xCores in cluster
  • Max – task should not take no longer than 100ms
  • Memory problem – dmesg oom-killer
  • Use build in aggregateByKey noy your own aggregation not groupBy
  • Filter as early you can
  • Use KyroSerializer
  • SSD disks YARN local dir (shuffle is faster)
  • USE High level API’s (DataFrame for core porcessing)
  • rdd.reduceByKey(func) is better than rdd.groupByKey() and reduce
  • Use data.join().explain()

    RDD.distinct – Shuffles!

  • Learning Spark (e-book)

Apache-Spark 2.x + Yarn – some errors and solutions

Problem:
2017-03-24 09:15:55,235 ERROR [dispatcher-event-loop-2] cluster.YarnScheduler: Lost executor 2 on bigdata38.webmedia.int: Container marked as failed: container_e50_1490337980512_0004_01_000003 on host: bigdata38.webmedia.int. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e50_1490337980512_0004_01_000003
Exit code: 52
Container exited with a non-zero exit code 52
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 – i.e. an OutOfMemoryError

Problem:

2017-03-24 09:33:49,251 WARN  [dispatcher-event-loop-4] cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e50_1490337980512_0006_01_000002 on host: bigdata33.webmedia.int. Exit status: -100. Diagnostics: Container released on a *lost* node

2017-03-24 09:33:46,427 WARN  nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/local error, used space above threshold of 90.0%, removing from list of valid directories

2017-03-24 09:33:46,427 WARN  nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(311)) – Directory /hadoop/yarn/log error, used space above threshold of 90.0%, removing from list of valid directories

2017-03-24 09:33:46,427 INFO  nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:logDiskStatus(373)) – Disk(s) failed: 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log

2017-03-24 09:33:46,428 ERROR nodemanager.LocalDirsHandlerService (LocalDirsHandlerService.java:updateDirsAfterTest(366)) – Most of the disks failed. 1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log

 

Problem:

2017-03-24 09:40:45,618 WARN  [dispatcher-event-loop-9] scheduler.TaskSetManager: Lost task 53.0 in stage 2.2 (TID 440, bigdata38.webmedia.int): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_e50_1490337980512_0006_01_000010 on host: bigdata38.webmedia.int. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.

May-be problem (if it takes long time – usually should be less than 50ms):

2017-03-24 11:46:41,488 INFO  recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-0 from (begin) .. (end); will stop at (end)

2017-03-24 11:46:41,489 INFO  recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from (begin) .. (end); will stop at ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1

2017-03-24 11:46:41,499 INFO  recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) – Manual compaction at level-1 from ‘NMTokens/appattempt_1490337980512_0011_000001’ @ 10303 : 1 .. (end); will stop at (end)

2017-03-24 11:46:41,500 INFO  recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:run(1023)) – Full compaction cycle completed in 20 msec

yarn.resourcemanager.leveldb-state-store.compaction-interval-secs

yarn.timeline-service.leveldb-timeline-store.path

Problem:

ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

Out-Of-Memory error

17/03/31 15:31:12 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker-26,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded

Ilus pilt Krissuga

Harva saab endast nii kena pildi lasta teha. Patt oleks seda siis ainult endale hoida.

Smart Home GUI

Now I can operate with some devices on my home via GUI – Awesome!

Posted in IOT

Hadoop Object Storage – Ozone

https://wiki.apache.org/hadoop/Ozone

Downloaded last hadoop development source (hadoop-3.0.0-alpha2) switched to HDFS-7240 branch where ozone development is taking place. Build it – success.

 

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs
Usage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]

OPTIONS is none or any of:

–buildpaths attempt to add class files from build tree
–config dir Hadoop config directory
–daemon (start|status|stop) operate on a daemon
–debug turn on shell script debug mode
–help usage information
–hostnames list[,of,host,names] hosts to use in worker mode
–hosts filename list of hosts to use in worker mode
–loglevel level set the log4j level for this command
–workers turn on worker mode

SUBCOMMAND is one of:

balancer run a cluster balancing utility
cacheadmin configure the HDFS cache
classpath prints the class path needed to get the hadoop jar and the required libraries
crypto configure HDFS encryption zones
datanode run a DFS datanode
debug run a Debug Admin to execute HDFS debug commands
dfsadmin run a DFS admin client
dfs run a filesystem command on the file system
diskbalancer Distributes data evenly among disks on a given node
envvars display computed Hadoop environment variables
erasurecode run a HDFS ErasureCoding CLI
fetchdt fetch a delegation token from the NameNode
fsck run a DFS filesystem checking utility
getconf get config values from configuration
groups get the groups which users belong to
haadmin run a DFS HA admin client
jmxget get JMX exported values from NameNode or DataNode.
journalnode run the DFS journalnode
lsSnapshottableDir list all snapshottable dirs owned by the current user
mover run a utility to move block replicas across storage types
namenode run the DFS namenode
nfs3 run an NFS version 3 gateway
oev apply the offline edits viewer to an edits file
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to a legacy fsimage
oz command line interface for ozone
portmap run a portmap service
scm run the Storage Container Manager service
secondarynamenode run the DFS secondary namenode
snapshotDiff diff two snapshots of a directory or diff the current directory contents with a snapshot
storagepolicies list/get/set block storage policies
version print the version
zkfc run the ZK Failover Controller daemon

 

As you can see new fancy attributes like oz and scm are there.

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ bin/hdfs oz
ERROR: oz is not COMMAND nor fully qualified CLASSNAME.

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ bin/hdfs scm
Error: Could not find or load main class

No luck. I was out of ideas so wrote to hadoop users list. No answers. After it I tried hadoop developers list and got help:

Hi Margus,

It looks like there might have been some error when merging trunk into HDFS-7240, which mistakenly
changed some entries in hdfs script. Thanks for the catch!

We will update the branch to fix it. In the meantime, as a quick fix, you can apply the attached
patch file and re-compile, OR do the following manually:

1. open hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
2. between
oiv_legacy)
       HADOOP_CLASSNAME=org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer
     ;;
 and
portmap)
       HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"
       HADOOP_CLASSNAME=org.apache.hadoop.portmap.Portmap
     ;;
add
oz) 
    HADOOP_CLASSNAME=org.apache.hadoop.ozone.web.ozShell.Shell 
;;
3. change this line
CLASS='org.apache.hadoop.ozone.storage.StorageContainerManager'
to
HADOOP_CLASSNAME='org.apache.hadoop.ozone.storage.StorageContainerManager'
4. re-compile.


rebuild it and it helped.

Lets try to play whit a new toy.

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs oz -v -createVolume http://127.0.0.1:9864/margusja -user ozone -quota 10GB -root
Volume name : margusja
{
 "owner" : {
 "name" : "ozone"
 },
 "quota" : {
 "unit" : "GB",
 "size" : 10
 },
 "volumeName" : "margusja",
 "createdOn" : "Fri, 03 Feb 2017 10:13:39 GMT",
 "createdBy" : "hdfs"
}

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs oz -createBucket http://127.0.0.1:9864/margusja/demo -user ozone -v
Volume Name : margusja
Bucket Name : demo
{
 "volumeName" : "margusja",
 "bucketName" : "demo",
 "acls" : null,
 "versioning" : "DISABLED",
 "storageType" : "DISK"
}

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs oz -v -putKey http://127.0.0.1:9864/margusja/demo/key001 -file margusja.txt
Volume Name : margusja
Bucket Name : demo
Key Name : key001
File Hash : 4273b3664fcf8bd89fd2b6d25cdf64ae


[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs oz -v -putKey http://127.0.0.1:9864/margusja/demo/key002 -file margusja2.txt
Volume Name : margusja
Bucket Name : demo
Key Name : key002

[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$ ./bin/hdfs oz -v -listKey http://127.0.0.1:9864/margusja/demo/
Volume Name : margusja
bucket Name : demo
{
 "version" : 0,
 "md5hash" : "4273b3664fcf8bd89fd2b6d25cdf64ae",
 "createdOn" : "Fri, 03 Feb 2017 12:25:43 +0200",
 "size" : 21,
 "keyName" : "key001"
}
{
 "version" : 0,
 "md5hash" : "4273b3664fcf8bd89fd2b6d25cdf64ae",
 "createdOn" : "Fri, 03 Feb 2017 12:26:14 +0200",
 "size" : 21,
 "keyName" : "key002"
}
[ozone@bigdata24 hadoop-3.0.0-alpha2-SNAPSHOT]$


To compare with filesystem we created directory /margusja after it created subdirectory margusja/demo and finally added two files to margusja/demo/. 
So the picture is smth like

/margusja (volume)
/margusja/demo (bucket)
/margusja/demo/margusja.txt (key001)
/margusja/demo/margusja2.txt (key002)

sonoff pow to Sonoff-MQTT-OTA-Arduino

Hiinlased on tulnud välja päris taskukohase tükiga – https://www.itead.cc/sonoff-pow.html. Tegemis on wifi kaudu lülitatava releega (230v/16A) piisav enamus kodumajapidamises ühefaasiliste jubinate kontrollimiseks.

2016-11-27-10-24-33

Kui nüüd jubin lahti võtta (küsimusele: “Miks peaks?” otsige vastust raamatust “Hackers: Heroes of the Computer Revolution” by S. Levy), siis leiame sealt huvipakkuva pordi:

2016-11-27-11-05-31

GND ja VDD vahele lähevad veel serial RX ja TX.

Loodus tühja kohta ei salli. Github’ist leiab projekti https://github.com/arendst/Sonoff-MQTT-OTA-Arduino. Tänud Ull’le (alias Märt Maiste), kes need kaks asja mul kokku aitas panna.

Edasi on lihtne. Tuleb github projekt alla laadida. Kokku lasta ja jubina sisse lasta. Kuna mul parajasti ühtegi töökorras FTDI plaati ei olnud, siis aitas arduino plaat hädast välja.

2016-11-27-10-34-08

screen-shot-2016-11-27-at-10-38-44

Kui nüüd jubin kenasti vooluvõrku panna ja muud seadistused teha, siis peaks kodusest DHCP serverist saama ta IP ja avades selle IP veebilehitsejas peaks avanema pilt:

screen-shot-2016-11-27-at-11-21-27

Kõnealune jubin toetab MQTT protokolli, mis annab väga vajaliku kihi raud- ja tarkvara vahele.

Mina paigaldasin raspberry pi peale mosquitto MQTT serveri (tnx Ull vihje eest). Nüüd on võimalik MQTT sub käsuga kuulata jubina staatust. Näiteks kas ta on sisse lülitatud, pinget, voolu tarbimist ja palju muud veel. Kõike seda saab ka veebiliidese kaudu

screen-shot-2016-11-27-at-11-26-29

screen-shot-2016-11-27-at-11-29-59

 

Kui nüüd WAN port suunata raspberry 22 porti, saab (juhul kui internet on olemas ja kodus ka LAN peal kõik toimib) kontrollida eemalt oma jubinaid

screen-shot-2016-11-27-at-11-36-25

Lisaks peaks kogu see kompott kokku istuma OpenHub projektiga.

Basys2 with external clock

Recently discovered that internal build in clock is quite unstable. Making a simple stopper was quite unsharp.

Added external clock (25Mz) and the picture is much better.

20161023_115610