Oct 312012

Need to rewrite the function to count Chinese words in UTF8 once I have time, and need to refer to this page which contains UTF8 characters.

In short, here are Chinese related code blocks:

  • U+2E80 … U+2EFF: CJK Radicals Supplement
  • U+3000 … U+303F: CJK Symbols and Punctuation
  • U+31C0 … U+31EF: CJK Strokes
  • U+3200 … U+32FF: Enclosed CJK Letters and Months
  • U+3300 … U+33FF: CJK Compatibility
  • U+3400 … U+4DBF: CJK Unified Ideographs Extension A
  • U+4E00 … U+9FFF: CJK Unified Ideographs
  • U+F900 … U+FAFF: CJK Compatibility Ideographs
  • U+FE30 … U+FE4F: CJK Compatibility Forms
  • U+20000 … U+2A6DF: CJK Unified Ideographs Extension B
  • U+2A700 … U+2B73F: CJK Unified Ideographs Extension C
  • U+2B740 … U+2B81F: CJK Unified Ideographs Extension D
  • U+2F800 … U+2FA1F: CJK Compatibility Ideographs Supplement

For rest blocks … let’s simply take them as western characters, though they are not.

