Margus Roo – – Page 16 – If you're inventing and pioneering, you have to be willing to be misunderstood for long periods of time

Apache-PIG how to save output into different places

Posted on December 18, 2014 - December 18, 2014 by margusja

Recently I had problem where I had many huge files containing timestamps and I had to separate thous lines into separate files.

Basically group by and save groups to separate files.

First I tried do to it in apache-hive and somehow I reached to the result but I did’t like it. Deep inside I felt there have to be better and cleaner solution for that.

I can’t share the original dataset but lets generate some sample data and play with it. Because problem is the same.

A frame of my example dataset

Actually there are 1001 rows including header

So as you can see there is column country. I going to use apache-pig to split rows so that finally I can save them in to different directories in to hadoop HDFS.

First let’s load our data and describe schema. chararray is similar to the string type what is familiar from many different langues to as.

A = LOAD ‘/user/margusja/pig_demo_files/MOCK_DATA.csv’ using PigStorage(‘,’) AS (id: int, first_name: chararray, last_name: chararray, email: chararray, country: chararray,
ip_address: chararray);

so and final PIG sentence will be:

STORE A INTO ‘/user/margusja/pig_demo_out/’ USING org.apache.pig.piggybank.storage.MultiStorage(‘/user/margusja/pig_demo_out’, ‘4’, ‘none’, ‘,’);

Some additional words about line above. Let me explain MultiStorage’s header (‘/user/margusja/pig_demo_out’, ‘4’, ‘none’, ‘,’)

The first argument is path in HDFS. That is the place we are going to find our generated directories containing countries files. It has to be similar we are using after STORE A INTO …

The second argument is column’s index we are going to use as directory name. Third, in our case in none, we can use if we’d like to compress data. The last one is separator between columns.

So let’s run our tiny but very useful pig script

> pig -f demo.pig

It starting map/reduce job in our hadoop cluster. After it finished we can admit the result in our hadoop HDFS.

Some snapshots from the result via HUE

If we look into directory Estonia we can find there is a file contains only rows where country is Estonia

So my opinion is that this is awesome!

Python + numpy matrix inverse session

Posted on November 28, 2014 by margusja

margusja@IRack:~/Documents/itcollege/lineaaralgebra$ python
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from numpy import *
>>> A = matrix(‘1.0 2.0; 3.0 4.0’)
>>> A
matrix([[ 1., 2.],
[ 3., 4.]])
>>> A.I
matrix([[-2. , 1. ],
[ 1.5, -0.5]])
>>>

Energia jäävuse seadus ja lineaarsus. Kas on seotud?

Posted on November 20, 2014 by margusja

Hive UDF

Posted on November 6, 2014 - November 6, 2014 by margusja

Sometimes (often) we need some custom functions to work with records. Hive has most necessary functions but still if you find yourself in situation where you need do some hack in your programming language after you got records there is place to consider to use Hive UDF.

In example in case we need add string “Hello Margusja” before field. Yes there is concat in Hive string functions but this is an example how to build and deploy UDF’s. So in case there is no any alternative to put two string together we are coing to build own UDF.

Java code is very simple – you just have to extend org.apache.hadoop.hive.ql.exec.UDF:

package com.margusja.example;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class DemoUDF extends UDF {

String hello = “Hello Margusja”;

public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(hello + ” ” + s );
}
}

build and package it in example HiveDemoUDF.jar

Now in hive command line add it to classpath:

hive> add jar /tmp/HiveDemoUDF.jar;

Added /tmp/HiveDemoUDF.jar to class path
Added resource: /tmp/HiveDemoUDF.jar

Now you can use your brand new UDF:

hive> select my_lower(“input”);
Query ID = margusja_20141106153636_564cd6c4-01f1-4daa-841c-4388255135a8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1414681778119_0094, Tracking URL = http://nn1.server.int:8088/proxy/application_1414681778119_0094/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414681778119_0094
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-11-06 15:36:21,935 Stage-1 map = 0%, reduce = 0%
2014-11-06 15:36:31,206 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.18 sec
MapReduce Total cumulative CPU time: 1 seconds 180 msec
Ended Job = job_1414681778119_0094
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.18 sec HDFS Read: 281 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 180 msec
OK
Hello Margusja input
Time taken: 21.417 seconds, Fetched: 1 row(s)

Kaks nostalgiahõngulist pilti

Posted on October 28, 2014 by margusja

Excellent picture describing map-reduce

Posted on October 10, 2014 - October 10, 2014 by margusja

Sometimes it is difficult to explain how map-reduce works.

Here I found a picture describes technology very well

Kafka Benchmark – Could not find or load main class org.apache.kafka.clients.tools.ProducerPerformance

Posted on October 7, 2014 - October 7, 2014 by margusja

OS: Centos 6.5

Kafka from kafka-0.8.1.2.1.4.0-632.el6.noarch repo. Installed using yum.

When I wanted to use perfomance tools:

[server1 kafka]# bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance

Error: Could not find or load main class org.apache.kafka.clients.tools.ProducerPerformance

Then I tried different methods to get workaround but for me worked following:

cd /usr/local/

git clone https://git-wip-us.apache.org/repos/asf/kafka.git

yum install git

git clone https://git-wip-us.apache.org/repos/asf/kafka.git

cd kafka

git checkout -b 0.8 remotes/origin/0.8

./sbt update

./sbt package

./sbt assembly-package-dependency

./bin/kafka-producer-perf-test.sh – it now works!

[root@server1 kafka]# ./bin/kafka-producer-perf-test.sh –topic kafkatopic –broker-list server1:9092 –messages 1000000 –show-detailed-stats –message-size 1024

start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec

2014-10-07 09:45:13:825, 2014-10-07 09:45:23:781, 0, 1024, 200, 976.56, 98.0878, 1000000, 100441.9446

Egolaks

Posted on October 6, 2014 by margusja

Hadoop namenode HA and hive metastore location urls

Posted on October 6, 2014 by margusja

Recently switched hadoop namenode to namenode HA. Most steps went successfully but hive was unhappy and tried to locate files via old url. So I found tutorial http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/sysadminguides_ha_chap3.html and used thous commands but only some tables changed after using thous commands.

Then I did it manually and it helped. In case your hive metadata is in mysql then you can connect to your hive db and use command:

UPDATE SDS SET LOCATION=REPLACE(LOCATION, ‘ hdfs://mycluster’, ‘hdfs://namenode:8020’);

After it I can switch active namenodes around and hive still can locate files via metastore.

Maybe this is helpful for someone 🙂

Hadoop HDFS is CORRUPT

Posted on September 24, 2014 by margusja

There are probably many ways to fix the problem.

In my case thous commands helped

# su – hdfs

Find corrupt and missing files on hdfs

# hadoop fsck / | egrep -v ‘^\.+$’ | grep -v eplica

Lets delete them. This is not a problem in case we are going to delete one block – we have replicas in another machines

#hadoop fsck / -delete

And now my hdfs is healthy