Nov 172011
 

I got 700K lines of apache log files from a friend’s web server and imported them to the testing Hadoop instance running on my MacBoox, following the instructions listed here I successfully run some analysis.

Note, the last section in the article talking about the Apache Weblog Data doesn’t seem to be correct – it lacks of some space (” “) after ^ and it gave me quite some headache since I’m not familiar with Java regular expressions. Luckily Hive issue 662 mentioned in the article gave me the correct regex to get things done.

It seems I can only learn to play with Hive/Hadoop cos Hadoop running on MacBook is still a single node installation which is … SLOW, but so far I’m fine with it as I don’t have high volume of data to be processed. As a reference, getting top accessed IPs (which I used to figure out potential abusers) took 83 seconds. The HSQL is simple, something like “select host, count(*) cc from apachelog group by host order by cc desc limit 10;”.

Hadoop is a richmen’s game, seems it only improve the performance whenever you have lots of nodes as it can well distributed tasks.

BTW, Hadoop: The Definitive Guide is a good book ๐Ÿ™‚ .

  6 Responses to “Playing with Hive”

  1. Every time I have to run this under Hive:

    set mapred.jpb.tracker=localhost:9001;
    add jar /opt/hive/lib/hive-contrib-0.7.1.jar;
    select host, count(*) cc from apachelog group by host order by cc desc limit 10;

    Need to find a easy way to save my key-in times.

  2. I’ve successfully setup a derby server and put hive metastore on it, now I’m trying to move metastore to hadoop itself so that I don’t have to start another Java instance.

  3. it seems FileStore is no longer supported so I leave everything there in Derby, at least for now.

  4. I moved to MySQL for metastore since Derby is not that pretty, especially running as a network service.

  5. Put following lines into hive-default.xml:

    <property>
    <name>hive.aux.jars.path</name>
    <value>file:///opt/hive/lib/hive-contrib-0.7.1.jar</value>
    </property>

    so that I can get rid of running “add jar …” every time.

  6. And one for configuration in hive-default.xml for setting up default job tracker:

    <property>
    <name>mapred.jpd.tracker</name>
    <value>localhost:9001</value>
    </property>

Sorry, the comment form is closed at this time.