Jan 252013
 

I was asked to find a solution to write customized Apache log entry (in Apache access log, not error log), other requirements include easy to use, which implicitly means “PHP friendly”, and flexible, which means “free format” and “may or may not have the value”.

Seriously, the first thing jumped into my mind was an Apache module (let’s hack mod_log!) or a PHP extension, but later on I found that … I don’t want to maintain another orphan project especially it will be orphan since nobody want to touch it after it’s up and running (I have one already, which does some cookie stuffs). So I move on to Google …

Then I found apache_note(), plus mod_log_config (search for %{Foobar}n), all that I need to do is tell PHP guys to call this function, and tell ops guys to setup Apache log properly, we are all set.

It’s good to see apache_note can change the note as many times as you want, whenever the request is finished on server side, the final data will be written to log.

Nov 172011
 

I got 700K lines of apache log files from a friend’s web server and imported them to the testing Hadoop instance running on my MacBoox, following the instructions listed here I successfully run some analysis.

Note, the last section in the article talking about the Apache Weblog Data doesn’t seem to be correct – it lacks of some space (” “) after ^ and it gave me quite some headache since I’m not familiar with Java regular expressions. Luckily Hive issue 662 mentioned in the article gave me the correct regex to get things done.

It seems I can only learn to play with Hive/Hadoop cos Hadoop running on MacBook is still a single node installation which is … SLOW, but so far I’m fine with it as I don’t have high volume of data to be processed. As a reference, getting top accessed IPs (which I used to figure out potential abusers) took 83 seconds. The HSQL is simple, something like “select host, count(*) cc from apachelog group by host order by cc desc limit 10;”.

Hadoop is a richmen’s game, seems it only improve the performance whenever you have lots of nodes as it can well distributed tasks.

BTW, Hadoop: The Definitive Guide is a good book ๐Ÿ™‚ .