Jun 252012
 

A newbie note.

I got this while running a streaming job:

12/06/26 05:43:05 ERROR streaming.StreamJob: Error Launching job : java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.QueueManager.getQueueACL(QueueManager.java:382)
at org.apache.hadoop.mapred.JobTracker.getQueueAdmins(JobTracker.java:4444)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

Turned out it’s an ACL problem. Running “hadoop queue -showacls” gives me the list of queues that I have access, then relaunch the task with:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-Dmapred.job.queue.name=queue_i_have_access \
….

and everything runs smootly.

I will post more newbie comments here.

  5 Responses to “Hadoop streaming errors”

  1. I was thinking of using Python to do Hadoop streaming, which makes it easy to change codes in production environment, however I failed because of ctyes module is not there in Python 2.4.x, which is the standard deployment of CentOS 5.

    Now I turned to write a JUMBO C++ program to get everything done, it is jumbo because almost everything is statically linked. The executable was built on CentOS5, so it can be run on CentOS6 as well.

    It seems I need to deliver Java version of my stuffs ASAP, if I still have to run things on Hadoop.

  2. And to read input from gzipped file, put -Dstream.recordreader.compression=gzip in the command line.

    Also you can exchange data between Hive and Hadoop directly, like:

    INSERT OVERWRITE DIRECTORY ‘hdfs://path/to/hdfs/directory/’ SELECT ….
    LOAD DATA INPATH ‘hdfs://path/to/hdfs/file’ OVERWRITE INTO TABLE the_table …

    I’m trying to move as much as I can to Hadoop/Hive now so to speed up the processing. So far I’ve lowerred down the context analysis time from ~5 hours (4 hours or processing, 1 hour for copying results to Hive gateway) to ~30 minutes.

  3. As a side note here, data volumn in different steps are:

    item list (5MBytes, gzipped) => context to be analyzed (12GBytes, gzipped) => analysis result (11GBytes, plain) => hive table (253M records)

  4. Since I don’t have a reducer so setting -Dmapred.reduce.tasks=0 reduced the running time from ~30 min to ~2 min.

    It makes sense, right? Why bother to sort and shuffle data whenever you don’t want to do any processing work?

  5. Needless to say, setting reduce.tasks to 0 will leave results in multiple files since there is no reducer at all, but this is not a problem to me as Hive’s LOAD DATA supports filename wildcards.

Sorry, the comment form is closed at this time.