BigData – Margus Roo

Apache-PIG how to save output into different places

Posted on December 18, 2014 - December 18, 2014 by margusja

Recently I had problem where I had many huge files containing timestamps and I had to separate thous lines into separate files.

Basically group by and save groups to separate files.

First I tried do to it in apache-hive and somehow I reached to the result but I did’t like it. Deep inside I felt there have to be better and cleaner solution for that.

I can’t share the original dataset but lets generate some sample data and play with it. Because problem is the same.

A frame of my example dataset

Actually there are 1001 rows including header

So as you can see there is column country. I going to use apache-pig to split rows so that finally I can save them in to different directories in to hadoop HDFS.

First let’s load our data and describe schema. chararray is similar to the string type what is familiar from many different langues to as.

A = LOAD ‘/user/margusja/pig_demo_files/MOCK_DATA.csv’ using PigStorage(‘,’) AS (id: int, first_name: chararray, last_name: chararray, email: chararray, country: chararray,
ip_address: chararray);

so and final PIG sentence will be:

STORE A INTO ‘/user/margusja/pig_demo_out/’ USING org.apache.pig.piggybank.storage.MultiStorage(‘/user/margusja/pig_demo_out’, ‘4’, ‘none’, ‘,’);

Some additional words about line above. Let me explain MultiStorage’s header (‘/user/margusja/pig_demo_out’, ‘4’, ‘none’, ‘,’)

The first argument is path in HDFS. That is the place we are going to find our generated directories containing countries files. It has to be similar we are using after STORE A INTO …

The second argument is column’s index we are going to use as directory name. Third, in our case in none, we can use if we’d like to compress data. The last one is separator between columns.

So let’s run our tiny but very useful pig script

> pig -f demo.pig

It starting map/reduce job in our hadoop cluster. After it finished we can admit the result in our hadoop HDFS.

Some snapshots from the result via HUE

If we look into directory Estonia we can find there is a file contains only rows where country is Estonia

So my opinion is that this is awesome!

Kafka Benchmark – Could not find or load main class org.apache.kafka.clients.tools.ProducerPerformance

Posted on October 7, 2014 - October 7, 2014 by margusja

OS: Centos 6.5

Kafka from kafka-0.8.1.2.1.4.0-632.el6.noarch repo. Installed using yum.

When I wanted to use perfomance tools:

[server1 kafka]# bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance

Error: Could not find or load main class org.apache.kafka.clients.tools.ProducerPerformance

Then I tried different methods to get workaround but for me worked following:

cd /usr/local/

git clone https://git-wip-us.apache.org/repos/asf/kafka.git

yum install git

git clone https://git-wip-us.apache.org/repos/asf/kafka.git

cd kafka

git checkout -b 0.8 remotes/origin/0.8

./sbt update

./sbt package

./sbt assembly-package-dependency

./bin/kafka-producer-perf-test.sh – it now works!

[root@server1 kafka]# ./bin/kafka-producer-perf-test.sh –topic kafkatopic –broker-list server1:9092 –messages 1000000 –show-detailed-stats –message-size 1024

start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec

2014-10-07 09:45:13:825, 2014-10-07 09:45:23:781, 0, 1024, 200, 976.56, 98.0878, 1000000, 100441.9446

Hadoop namenode HA and hive metastore location urls

Posted on October 6, 2014 by margusja

Recently switched hadoop namenode to namenode HA. Most steps went successfully but hive was unhappy and tried to locate files via old url. So I found tutorial http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/sysadminguides_ha_chap3.html and used thous commands but only some tables changed after using thous commands.

Then I did it manually and it helped. In case your hive metadata is in mysql then you can connect to your hive db and use command:

UPDATE SDS SET LOCATION=REPLACE(LOCATION, ‘ hdfs://mycluster’, ‘hdfs://namenode:8020’);

After it I can switch active namenodes around and hive still can locate files via metastore.

Maybe this is helpful for someone 🙂

apache hive how to join log files and use sql queries over joined data

Posted on September 22, 2014 - September 22, 2014 by margusja

Let’s create two very simple log files. Into log1.txt file lets put in example users problems log data and into log2.txt file solutions log data

log1.txt:

user1 | 2014-09-23 | error message 1
user2 | 2014-09-23 | error message 2
user3 | 2014-09-23 | error message 3
user4 | 2014-09-23 | error message 1
user5 | 2014-09-23 | error message 2
user6 | 2014-09-23 | error message 12
user7 | 2014-09-23 | error message 11
user1 | 2014-09-24 | error message 1
user2 | 2014-09-24 | error message 2
user3 | 2014-09-24 | error message 3
user4 | 2014-09-24 | error message 10
user1 | 2014-09-24 | error message 17
user2 | 2014-09-24 | error message 13
user1 | 2014-09-24 | error message 1

log2.txt:

user1 | support2 | solution message 1
user2 | support1 | solution message 2
user3 | support2 | solution message 3
user1 | support1 | solution message 4
user2 | support2 | solution message 5
user4 | support1 | solution message 6
user2 | support2 | solution message 7
user5 | support1 | solution message 8

Create two tables for datasets above:

hive> create table log1 (user STRING, date STRING, error STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’ STORED AS TEXTFILE;

OK

Time taken: 5.968 seconds

hive> LOAD DATA INPATH ‘/user/margusja/hiveinput/log1.txt’ OVERWRITE INTO TABLE log1;

Loading data to table default.log1

rmr: DEPRECATED: Please use ‘rm -r’ instead.

Moved: ‘hdfs://bigdata1.host.int:8020/apps/hive/warehouse/log1’ to trash at: hdfs://bigdata1.host.int:8020/user/margusja/.Trash/Current

Table default.log1 stats: [numFiles=1, numRows=0, totalSize=523, rawDataSize=0]

OK

Time taken: 4.687 seconds

hive> create table log2 (user STRING, support STRING, solution STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’ STORED AS TEXTFILE;

OK

Time taken: 0.997 seconds

hive> LOAD DATA INPATH ‘/user/margusja/hiveinput/log2.txt’ OVERWRITE INTO TABLE log2;

Loading data to table default.log2

rmr: DEPRECATED: Please use ‘rm -r’ instead.

Moved: ‘hdfs://bigdata1.host.int:8020/apps/hive/warehouse/log2’ to trash at: hdfs://bigdata1.host.int:8020/user/margusja/.Trash/Current

Table default.log2 stats: [numFiles=1, numRows=0, totalSize=304, rawDataSize=0]

OK

Time taken: 0.72 seconds

hive>

Now let’s make SQL over two datafile placed to HDFS storage using HIVE:

hive> select log1.user, log1.date, log1.error, log2.support, log2.solution from log2 join log1 on (log1.user = log2.user);

And result. We see now how two separated log file are joined together and now we can see in example that user2 has error message 2 in 2012-09-23 and support2 offered solution message 7.

user1 2014-09-23 error message 1 support2 solution message 1

user1 2014-09-23 error message 1 support1 solution message 4

user2 2014-09-23 error message 2 support1 solution message 2

user2 2014-09-23 error message 2 support2 solution message 5

user2 2014-09-23 error message 2 support2 solution message 7

user3 2014-09-23 error message 3 support2 solution message 3

user4 2014-09-23 error message 1 support1 solution message 6

user5 2014-09-23 error message 2 support1 solution message 8

user1 2014-09-24 error message 1 support2 solution message 1

user1 2014-09-24 error message 1 support1 solution message 4

user2 2014-09-24 error message 2 support1 solution message 2

user2 2014-09-24 error message 2 support2 solution message 5

user2 2014-09-24 error message 2 support2 solution message 7

user3 2014-09-24 error message 3 support2 solution message 3

user4 2014-09-24 error message 10 support1 solution message 6

user1 2014-09-24 error message 17 support2 solution message 1

user1 2014-09-24 error message 17 support1 solution message 4

user2 2014-09-24 error message 13 support1 solution message 2

user2 2014-09-24 error message 13 support2 solution message 5

user2 2014-09-24 error message 13 support2 solution message 7

user1 2014-09-24 error message 1 support2 solution message 1

user1 2014-09-24 error message 1 support1 solution message 4

Time taken: 34.561 seconds, Fetched: 22 row(s)

More cool things:

We can select only specified user:

hive> select log1.user, log1.date, log1.error, log2.support, log2.solution from log2 join log1 on (log1.user = log2.user) where log1.user like ‘%user1%’;