Need to rewrite the function to count Chinese words in UTF8 once I have time, and need to refer to this page which contains UTF8 characters.
In short, here are Chinese related code blocks:
- U+2E80 … U+2EFF: CJK Radicals Supplement
- U+3000 … U+303F: CJK Symbols and Punctuation
- U+31C0 … U+31EF: CJK Strokes
- U+3200 … U+32FF: Enclosed CJK Letters and Months
- U+3300 … U+33FF: CJK Compatibility
- U+3400 … U+4DBF: CJK Unified Ideographs Extension A
- U+4E00 … U+9FFF: CJK Unified Ideographs
- U+F900 … U+FAFF: CJK Compatibility Ideographs
- U+FE30 … U+FE4F: CJK Compatibility Forms
- U+20000 … U+2A6DF: CJK Unified Ideographs Extension B
- U+2A700 … U+2B73F: CJK Unified Ideographs Extension C
- U+2B740 … U+2B81F: CJK Unified Ideographs Extension D
- U+2F800 … U+2FA1F: CJK Compatibility Ideographs Supplement
For rest blocks … let’s simply take them as western characters, though they are not.