May 242012
 

OK, if you don’t know what I’m saying, don’t bother to ask.

I’m using MMSEG as segmentation algorithms to extract Chinese terms from web pages, and then do some follow-up analysis. MMSEG is pretty good, easy to understand, fast, data driven, fast, small, … alright, lots of advantages. However, it does not handle Chinese-English-mixed word properly, as it just take out English (Latin) words and return it as token, thus you can never extract words like U็›˜ or Tๆค (sorry for Chinese here, the first words is USB disk, and the second is T-Shirt).

I got suggestion from a colleague that treat Latin words as Chinese character, I found this solution was really simple, neat, and easy to implement. Within ~30 minutes I finished code change and unit test, it works just fine :). In the coming days I need to solve words like “C++” as it was not covered by current implementation.

People are saying Yahoo is not handling C++/C# properly, while Bing is not dealing with U็›˜ right. Actually this sound weird to me as supposedly they both use the same web search system …