Dec 212012
 

Guess I will spend some time on Hadoop and Hive/HBase in the coming year, so I’m preparing …

It seems Hadoop is doing pretty good, documentation is getting better, and people are thinking more about site operation, this means it is no longer a toy. Things like access control, seperation of daemon users, and backward compatibility, all these stuffs are making it a serious (enterprise???) tool.

Why I care about site operation so much? I’m seriously doubting if I’m on dev side or ops side …

Jun 252012
 

A newbie note.

I got this while running a streaming job:

12/06/26 05:43:05 ERROR streaming.StreamJob: Error Launching job : java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.QueueManager.getQueueACL(QueueManager.java:382)
at org.apache.hadoop.mapred.JobTracker.getQueueAdmins(JobTracker.java:4444)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

Turned out it’s an ACL problem. Running “hadoop queue -showacls” gives me the list of queues that I have access, then relaunch the task with:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-Dmapred.job.queue.name=queue_i_have_access \
….

and everything runs smootly.

I will post more newbie comments here.

Feb 082012
 

Just finished a prototype of terminology extraction based on nutch and solr, check test page.

I also have another (quick and dirty) script to inject new URLs into nutch and then solr, the whole demo is not finished yet since I need to put up something to remove outdated pages (what is outdated?).

The work flow should be something like this: Continue reading »

Feb 072012
 

I hit this problem in my project which is hadoop-based:

Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory

Did some research but found nothing useful – everybody mentioned it’s JDK’s problem not using fork()+exec() which caused excessive memory allocated during spawning new process for running shell command. However, it’s weird that I hit this problem on my AWS micro instance only, not on my MacBook, so I moved on to check some more –

It turned out swap is a problem, my micro instance in AWS does not have swap enabled (i.e. zero swap space), after add 1G swap everything’s fine now.

I’m a Java newbie, so my question is, though it got solved, did I do something properly?

Nov 302011
 

Just installed HBase on my Hadoop node, now I have a runnable HBase instance.

There were some issues and I want to list things clearly:

  • Hadoop version, remember to copy hadoop-core-*.jar (only one jar) from Hadoop deployment to hbase/lib
  • get a copy of commons-configuration*.jar, the one in hadoop/lib directory should work
Nov 172011
 

I got 700K lines of apache log files from a friend’s web server and imported them to the testing Hadoop instance running on my MacBoox, following the instructions listed here I successfully run some analysis.

Note, the last section in the article talking about the Apache Weblog Data doesn’t seem to be correct – it lacks of some space (” “) after ^ and it gave me quite some headache since I’m not familiar with Java regular expressions. Luckily Hive issue 662 mentioned in the article gave me the correct regex to get things done.

It seems I can only learn to play with Hive/Hadoop cos Hadoop running on MacBook is still a single node installation which is … SLOW, but so far I’m fine with it as I don’t have high volume of data to be processed. As a reference, getting top accessed IPs (which I used to figure out potential abusers) took 83 seconds. The HSQL is simple, something like “select host, count(*) cc from apachelog group by host order by cc desc limit 10;”.

Hadoop is a richmen’s game, seems it only improve the performance whenever you have lots of nodes as it can well distributed tasks.

BTW, Hadoop: The Definitive Guide is a good book ๐Ÿ™‚ .