Feb 082012
 

Just finished a prototype of terminology extraction based on nutch and solr, check test page.

I also have another (quick and dirty) script to inject new URLs into nutch and then solr, the whole demo is not finished yet since I need to put up something to remove outdated pages (what is outdated?).

The work flow should be something like this:

  1. Get URL from client’s request
  2. If the URL had been crawled and it is not outdated (not sure how to define this yet), return the terms and we are all set (exit 0!!! ๐Ÿ˜€ )
  3. kick the URL to the crawl queue so that it can be crawled later on
  4. return an empty term list back to client (sorry, we don’t know too much about the page, yet

Some improvements in my mind include:

  1. Whenever we haven’t crawled the page yet, the front end (JS in the web page) should be able to grab some keywords from the web page itself, the use them as terms
    • This may be resource consuming in browser, but I need to do some more tests
    • The JS has to be run at the very last of the web page so that it can get most text from the page, this may slower down the business logic, and could not be acceptable
    • Sure, we cannot put too many fancy logics in the JS so the result could be very bad
    • Since JS is clear text to everyone so it (I mean the engine) can be fooled by the web page owner, whenever they have motivation (such as getting more money? I’m not sure)
  2. The default TermVectorComponent does not give me sorted result, so sooner or later we need to have a customized class to handle this, I assume that we will deliver better result
  3. I’m using mmseg4j as the tokenizer, this may not be the best solution but I have no choice so far (again, quick and dirty), in the future if we cannot find something better, we should at least revise the dictionary to make things work better
  4. I prefer a C/C++ solution for this but I don’t think I can get one, however, I will wrap everything into a C/C++ library so that it can fit into other systems easily
  5. My demo deployment is running on a AWS micro node, with nutch in local mode, based on requirements I got, this should be ok since I don’t care about inter-document (page) relationship, however, if this becomes a requirement, then I have to run in deploy mode which means I need a hadoop cluster
  6. On solr side, I will do some calculation to see if I need a hadoop or not. It was said we won’t get web pages more than 100 millions so it maybe good enough to leave it as-is, but then I have to think about redundancy and failover, disaster recovery, and so on.

I guess I will still need a hadoop cluster, even it is small, at least I can handle part of my SLA problem to them >:) .

  One Response to “nutch, solr, term vector, and so on”

  1. Just found that smiley’s no longer working with patterns having “>” or “<", seems wordpress changed something recently, or maybe long ago but I never noticed.

Sorry, the comment form is closed at this time.