how to start hadoop MRv2

yarn_architecture

Since hadoop MRv1 and MRv2 are different I found good starting point from http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1-latest/bk_using-apache-hadoop/content/running_mapreduce_examples_on_yarn.html.

[root@vm37 hadoop-yarn]# yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar 

An example program must be given as the first argument.

Valid program names are:

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

  dbcount: An example job that count the pageview counts from a database.

  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

  grep: A map/reduce program that counts the matches of a regex in the input.

  join: A job that effects a join over sorted, equally partitioned datasets

  multifilewc: A job that counts words from several files.

  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

  randomwriter: A map/reduce program that writes 10GB of random data per node.

  secondarysort: An example defining a secondary sort to the reduce.

  sort: A map/reduce program that sorts the data written by the random writer.

  sudoku: A sudoku solver.

  teragen: Generate data for the terasort

  terasort: Run the terasort

  teravalidate: Checking results of terasort

  wordcount: A map/reduce program that counts the words in the input files.

  wordmean: A map/reduce program that counts the average length of the words in the input files.

  wordmedian: A map/reduce program that counts the median length of the words in the input files.

  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Now lets take wordcount example. I have downloaded example dataset and put it into hadoop fs

[root@vm37 hadoop-mapreduce]# hfds dfs -put wc.txt /user/margusja/wc/input/

Now execute mapreduce job

[root@vm37 hadoop-mapreduce]# yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar wordcount /user/margusja/wc/input /user/margusja/wc/output

14/08/13 12:34:59 INFO client.RMProxy: Connecting to ResourceManager at vm38.dbweb.ee/192.168.1.72:8032

14/08/13 12:35:00 INFO input.FileInputFormat: Total input paths to process : 1

14/08/13 12:35:01 INFO mapreduce.JobSubmitter: number of splits:1

14/08/13 12:35:01 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

14/08/13 12:35:01 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class

14/08/13 12:35:01 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

14/08/13 12:35:01 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

14/08/13 12:35:01 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

14/08/13 12:35:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1407837551751_0037

14/08/13 12:35:01 INFO impl.YarnClientImpl: Submitted application application_1407837551751_0037 to ResourceManager at vm38.dbweb.ee/192.168.1.72:8032

14/08/13 12:35:02 INFO mapreduce.Job: The url to track the job: http://vm38.dbweb.ee:8088/proxy/application_1407837551751_0037/

14/08/13 12:35:02 INFO mapreduce.Job: Running job: job_1407837551751_0037

14/08/13 12:35:11 INFO mapreduce.Job: Job job_1407837551751_0037 running in uber mode : false

14/08/13 12:35:11 INFO mapreduce.Job:  map 0% reduce 0%

14/08/13 12:35:21 INFO mapreduce.Job:  map 100% reduce 0%

14/08/13 12:35:31 INFO mapreduce.Job:  map 100% reduce 100%

14/08/13 12:35:31 INFO mapreduce.Job: Job job_1407837551751_0037 completed successfully

14/08/13 12:35:31 INFO mapreduce.Job: Counters: 43

        File System Counters

                FILE: Number of bytes read=167524

                FILE: Number of bytes written=493257

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=384341

                HDFS: Number of bytes written=120766

                HDFS: Number of read operations=6

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=2

        Job Counters 

                Launched map tasks=1

                Launched reduce tasks=1

                Data-local map tasks=1

                Total time spent by all maps in occupied slots (ms)=8033

                Total time spent by all reduces in occupied slots (ms)=7119

        Map-Reduce Framework

                Map input records=9488

                Map output records=67825

                Map output bytes=643386

                Map output materialized bytes=167524

                Input split bytes=134

                Combine input records=67825

                Combine output records=11900

                Reduce input groups=11900

                Reduce shuffle bytes=167524

                Reduce input records=11900

                Reduce output records=11900

                Spilled Records=23800

                Shuffled Maps =1

                Failed Shuffles=0

                Merged Map outputs=1

                GC time elapsed (ms)=172

                CPU time spent (ms)=5880

                Physical memory (bytes) snapshot=443211776

                Virtual memory (bytes) snapshot=1953267712

                Total committed heap usage (bytes)=317194240

        Shuffle Errors

                BAD_ID=0

                CONNECTION=0

                IO_ERROR=0

                WRONG_LENGTH=0

                WRONG_MAP=0

                WRONG_REDUCE=0

        File Input Format Counters 

                Bytes Read=384207

        File Output Format Counters 

                Bytes Written=120766

And in my cluster UI I can see my submitted job

Screen Shot 2014-08-13 at 12.35.16

After job is finished you can explore result via HDFS UI

Screen Shot 2014-08-13 at 12.41.04 Screen Shot 2014-08-13 at 12.40.18

or you can move it to your local dir using hdfs command line command

[root@vm37 hadoop-mapreduce]# hdfs dfs -get /user/margusja/wc/output/part-r-00000

[root@vm37 hadoop-mapreduce]# head part-r-00000 

”       6

“‘Among 2

“‘And   1

“‘Appen 1

“‘Ce    1

“‘Doigts’       1

“‘E’s   1

“‘Ello, 1

“‘Er    1

“‘Er’s  1

Next step might by digging into source code

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/

I am lazy person so I use maven to manage my java class dependencies

[margusja@vm37 ~]$ mvn archetype:generate -DgroupId=com.deciderlab.wordcount -DartifactId=wordcount -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

[INFO] Scanning for projects…

[INFO]                                                                         

[INFO] ————————————————————————

[INFO] Building Maven Stub Project (No POM) 1

[INFO] ————————————————————————

[INFO] 

[INFO] >>> maven-archetype-plugin:2.2:generate (default-cli) @ standalone-pom >>>

[INFO] 

[INFO] <<< maven-archetype-plugin:2.2:generate (default-cli) @ standalone-pom <<<

[INFO] 

[INFO] — maven-archetype-plugin:2.2:generate (default-cli) @ standalone-pom —

[INFO] Generating project in Batch mode

[INFO] —————————————————————————-

[INFO] Using following parameters for creating project from Old (1.x) Archetype: maven-archetype-quickstart:1.0

[INFO] —————————————————————————-

[INFO] Parameter: groupId, Value: com.deciderlab.wordcount

[INFO] Parameter: packageName, Value: com.deciderlab.wordcount

[INFO] Parameter: package, Value: com.deciderlab.wordcount

[INFO] Parameter: artifactId, Value: wordcount

[INFO] Parameter: basedir, Value: /var/www/html/margusja

[INFO] Parameter: version, Value: 1.0-SNAPSHOT

[INFO] project created from Old (1.x) Archetype in dir: /var/www/html/margusja/wordcount

[INFO] ————————————————————————

[INFO] BUILD SUCCESS

[INFO] ————————————————————————

[INFO] Total time: 6.382s

[INFO] Finished at: Wed Aug 13 13:08:59 EEST 2014

[INFO] Final Memory: 14M/105M

[INFO] ————————————————————————

Now you can move WordCount.java source to your src/main/…[whar ever is your dir]

I made some changes in Wordcount.java

package com.deciderlab.wordcount;

Job job = new Job(conf, “Margusja’s word count demo”);

in pom.xml I added some dependencies

    <dependency>

      <groupId>org.apache.hadoop</groupId>

      <artifactId>hadoop-common</artifactId>

      <version>2.4.1</version>

    </dependency>

    <dependency>

      <groupId>org.apache.hadoop</groupId>

      <artifactId>hadoop-core</artifactId>

      <version>1.2.1</version>

    </dependency>

now build your jar

[margusja@vm37 wordcount]$ mvn package

[INFO] Scanning for projects…

[INFO]                                                                         

[INFO] ————————————————————————

[INFO] Building wordcount 1.0-SNAPSHOT

[INFO] ————————————————————————

[INFO] 

[INFO] — maven-resources-plugin:2.5:resources (default-resources) @ WordCount —

[debug] execute contextualize

[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!

[INFO] skip non existing resourceDirectory /var/www/html/margusja/wordcount/src/main/resources

[INFO] 

[INFO] — maven-compiler-plugin:2.3.2:compile (default-compile) @ WordCount —

[INFO] Nothing to compile – all classes are up to date

[INFO] 

[INFO] — maven-resources-plugin:2.5:testResources (default-testResources) @ WordCount —

[debug] execute contextualize

[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!

[INFO] skip non existing resourceDirectory /var/www/html/margusja/wordcount/src/test/resources

[INFO] 

[INFO] — maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ WordCount —

[INFO] Nothing to compile – all classes are up to date

[INFO] 

[INFO] — maven-surefire-plugin:2.10:test (default-test) @ WordCount —

[INFO] Surefire report directory: /var/www/html/margusja/wordcount/target/surefire-reports

——————————————————-

 T E S T S

——————————————————-

Running com.deciderlab.wordcount.AppTest

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

[INFO] 

[INFO] — maven-jar-plugin:2.3.2:jar (default-jar) @ WordCount —

[INFO] ————————————————————————

[INFO] BUILD SUCCESS

[INFO] ————————————————————————

[INFO] Total time: 4.552s

[INFO] Finished at: Wed Aug 13 13:58:08 EEST 2014

[INFO] Final Memory: 12M/171M

[INFO] ————————————————————————

[margusja@vm37 wordcount]$ 

Now you are ready to lunch your first hadoop MRv2 job

[margusja@vm37 wordcount]$ hadoop jar /var/www/html/margusja/wordcount/target/WordCount-1.0-SNAPSHOT.jar com.deciderlab.wordcount.WordCount /user/margusja/wc/input /user/margusja/wc/output4

14/08/13 14:01:38 INFO mapreduce.Job: Running job: job_1407837551751_0040

14/08/13 14:01:47 INFO mapreduce.Job: Job job_1407837551751_0040 running in uber mode : false

14/08/13 14:01:47 INFO mapreduce.Job:  map 0% reduce 0%

14/08/13 14:01:58 INFO mapreduce.Job:  map 100% reduce 0%

14/08/13 14:02:07 INFO mapreduce.Job:  map 100% reduce 100%

14/08/13 14:02:08 INFO mapreduce.Job: Job job_1407837551751_0040 completed successfully

14/08/13 14:02:08 INFO mapreduce.Job: Counters: 43

        File System Counters

                FILE: Number of bytes read=167524

                FILE: Number of bytes written=493091

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=384341

                HDFS: Number of bytes written=120766

                HDFS: Number of read operations=6

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=2

        Job Counters 

                Launched map tasks=1

                Launched reduce tasks=1

                Data-local map tasks=1

                Total time spent by all maps in occupied slots (ms)=7749

                Total time spent by all reduces in occupied slots (ms)=6591

        Map-Reduce Framework

                Map input records=9488

                Map output records=67825

                Map output bytes=643386

                Map output materialized bytes=167524

                Input split bytes=134

                Combine input records=67825

                Combine output records=11900

                Reduce input groups=11900

                Reduce shuffle bytes=167524

                Reduce input records=11900

                Reduce output records=11900

                Spilled Records=23800

                Shuffled Maps =1

                Failed Shuffles=0

                Merged Map outputs=1

                GC time elapsed (ms)=114

                CPU time spent (ms)=6020

                Physical memory (bytes) snapshot=430088192

                Virtual memory (bytes) snapshot=1945890816

                Total committed heap usage (bytes)=317194240

        Shuffle Errors

                BAD_ID=0

                CONNECTION=0

                IO_ERROR=0

                WRONG_LENGTH=0

                WRONG_MAP=0

                WRONG_REDUCE=0

        File Input Format Counters 

                Bytes Read=384207

        File Output Format Counters 

                Bytes Written=120766

You can examine your running jobs

Screen Shot 2014-08-13 at 14.01.48

And result in HDFS UI

Screen Shot 2014-08-13 at 14.04.08