{"id":1147,"date":"2011-11-17T08:46:09","date_gmt":"2011-11-17T15:46:09","guid":{"rendered":"http:\/\/xiehang.com\/blog\/?p=1147"},"modified":"2011-11-17T08:46:09","modified_gmt":"2011-11-17T15:46:09","slug":"playing-with-hive","status":"publish","type":"post","link":"https:\/\/xiehang.com\/blog\/2011\/11\/17\/playing-with-hive\/","title":{"rendered":"Playing with Hive"},"content":{"rendered":"

I got 700K lines of apache log files from a friend’s web server and imported them to the testing Hadoop instance running on my MacBoox, following the instructions listed here<\/a> I successfully run some analysis.<\/p>\n

Note, the last section in the article talking about the <\/i>Apache Weblog Data<\/i> doesn’t seem to be correct – it lacks of some space (” “) after ^ and it gave me quite some headache since I’m not familiar with Java regular expressions. Luckily Hive issue 662<\/a> mentioned in the article gave me the correct regex to get things done.<\/p>\n

It seems I can only learn to play with Hive\/Hadoop cos Hadoop running on MacBook is still a single node installation which is … SLOW<\/b>, but so far I’m fine with it as I don’t have high volume of data to be processed. As a reference, getting top accessed IPs (which I used to figure out potential abusers) took 83 seconds. The HSQL is simple, something like “select host, count(*) cc from apachelog group by host order by cc desc limit 10;”.<\/p>\n

Hadoop is a richmen’s game, seems it only improve the performance whenever you have lots of nodes as it can well distributed tasks.<\/p>\n

BTW, Hadoop: The Definitive Guide<\/i> is a good book \ud83d\ude42 .<\/p>\n","protected":false},"excerpt":{"rendered":"

I got 700K lines of apache log files from a friend’s web server and imported them to the testing Hadoop instance running on my MacBoox, following the instructions listed here I successfully run some analysis. Note, the last section in the article talking about the Apache Weblog Data doesn’t seem to be correct – it […]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[82,230,240,231,239],"_links":{"self":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1147"}],"collection":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/comments?post=1147"}],"version-history":[{"count":1,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1147\/revisions"}],"predecessor-version":[{"id":1148,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1147\/revisions\/1148"}],"wp:attachment":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/media?parent=1147"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/categories?post=1147"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/tags?post=1147"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}