Mar 132012
 

This is to see how things are working, and try to get some idea to help my term extraction work.

There will be some blank areas from now (till Google team approves my adsense account), don’t feel weird if you see so, and in the future, once you see an ad, click it please ๐Ÿ˜› .

Feb 102012
 

I believe I’ve set up majority parts of my terminology extraction demo site, there are still several parts were glued together instead of tightly integrated, but it works.

I think it’s the time to get into science part – need to dig out what kind of terminology should be returned whenever I got dozens or hundreds from a web page. Current algorithm is pretty simple (and may not make sense at all), just for testing purpose: sort by tf-idf, and title’s tf-idf has 3 times higher weight than content’s.

Anyway, don’t want to mention too many details here at least for now, I still need to get those glued parts done in a better way.

Demo’s here: http://solr.xiehang.com/, note that this may be taken down anytime without notice since I don’t want to leave such a easy-to-be-abused entry point on my servers.

Feb 082012
 

Just finished a prototype of terminology extraction based on nutch and solr, check test page.

I also have another (quick and dirty) script to inject new URLs into nutch and then solr, the whole demo is not finished yet since I need to put up something to remove outdated pages (what is outdated?).

The work flow should be something like this: Continue reading »