Feb 102012
 

I believe I’ve set up majority parts of my terminology extraction demo site, there are still several parts were glued together instead of tightly integrated, but it works.

I think it’s the time to get into science part – need to dig out what kind of terminology should be returned whenever I got dozens or hundreds from a web page. Current algorithm is pretty simple (and may not make sense at all), just for testing purpose: sort by tf-idf, and title’s tf-idf has 3 times higher weight than content’s.

Anyway, don’t want to mention too many details here at least for now, I still need to get those glued parts done in a better way.

Demo’s here: http://solr.xiehang.com/, note that this may be taken down anytime without notice since I don’t want to leave such a easy-to-be-abused entry point on my servers.