{"id":1256,"date":"2012-02-08T19:01:20","date_gmt":"2012-02-09T02:01:20","guid":{"rendered":"http:\/\/xiehang.com\/blog\/?p=1256"},"modified":"2014-01-28T11:10:50","modified_gmt":"2014-01-28T18:10:50","slug":"nutch-solr-term-vector-and-so-on","status":"publish","type":"post","link":"https:\/\/xiehang.com\/blog\/2012\/02\/08\/nutch-solr-term-vector-and-so-on\/","title":{"rendered":"nutch, solr, term vector, and so on"},"content":{"rendered":"

Just finished a prototype of terminology extraction based on nutch<\/a> and solr<\/a>, check test page<\/a>.<\/p>\n

I also have another (quick and dirty) script to inject new URLs into nutch and then solr, the whole demo is not finished yet since I need to put up something to remove outdated pages (what is outdated?).<\/p>\n

The work flow should be something like this:<\/p>\n

    \n
  1. Get URL from client’s request<\/li>\n
  2. If the URL had been crawled and it is not outdated (not sure how to define this yet), return the terms and we are all set (exit 0!!! \ud83d\ude00 )<\/li>\n
  3. kick the URL to the crawl queue so that it can be crawled later on<\/li>\n
  4. return an empty term list back to client (sorry, we don’t know too much about the page, yet<\/li>\n<\/ol>\n

    Some improvements in my mind include:<\/p>\n

      \n
    1. Whenever we haven’t crawled the page yet, the front end (JS in the web page) should be able to grab some keywords from the web page itself, the use them as terms\n