{"id":1346,"date":"2012-05-24T23:38:06","date_gmt":"2012-05-25T06:38:06","guid":{"rendered":"http:\/\/xiehang.com\/blog\/?p=1346"},"modified":"2012-05-24T23:38:06","modified_gmt":"2012-05-25T06:38:06","slug":"chinese-english-mixed-word-in-segmentation","status":"publish","type":"post","link":"https:\/\/xiehang.com\/blog\/2012\/05\/24\/chinese-english-mixed-word-in-segmentation\/","title":{"rendered":"Chinese-English-mixed word in segmentation"},"content":{"rendered":"

OK, if you don’t know what I’m saying, don’t bother to ask.<\/p>\n

I’m using MMSEG as segmentation algorithms to extract Chinese terms from web pages, and then do some follow-up analysis. MMSEG is pretty good, easy to understand, fast, data driven, fast, small, … alright, lots of advantages. However, it does not handle Chinese-English-mixed word properly, as it just take out English (Latin) words and return it as token, thus you can never extract words like U\u76d8 or T\u6064 (sorry for Chinese here, the first words is USB disk, and the second is T-Shirt).<\/p>\n

I got suggestion from a colleague that treat Latin words as Chinese character, I found this solution was really simple, neat, and easy to implement. Within ~30 minutes I finished code change and unit test, it works just fine :). In the coming days I need to solve words like “C++” as it was not covered by current implementation.<\/p>\n

People are saying Yahoo is not handling C++\/C# properly, while Bing is not dealing with U\u76d8 right. Actually this sound weird to me as supposedly they both use the same web search system …<\/p>\n","protected":false},"excerpt":{"rendered":"

OK, if you don’t know what I’m saying, don’t bother to ask. I’m using MMSEG as segmentation algorithms to extract Chinese terms from web pages, and then do some follow-up analysis. MMSEG is pretty good, easy to understand, fast, data driven, fast, small, … alright, lots of advantages. However, it does not handle Chinese-English-mixed word […]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[341,343,340,342],"_links":{"self":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1346"}],"collection":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/comments?post=1346"}],"version-history":[{"count":1,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1346\/revisions"}],"predecessor-version":[{"id":1347,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1346\/revisions\/1347"}],"wp:attachment":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/media?parent=1346"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/categories?post=1346"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/tags?post=1346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}