Unicode block names in regular expressions

Frequently, I find myself wanting to do some simple language detection. For Chinese, Japanese, and Korean, this can easily be done by looking at the types of characters in some text. The simplest and most robust way to do this is to use Unicode block names. It is very simple to write a regular expression which will test if a character is contained in a certain block.
For all the different possible blocks, see here:
Unicode block names for use in XSD regular expressions

Here are some very simple blocks for detecting katakana, hiragana and kanji

robert_felty$ echo "ア" | perl perl -CIO -nle 'if (/\p{Katakana}/) { print "this contains katakana\n";}'  
this contains katakana  
robert_felty$ echo "あ" | perl perl -CIO -nle 'if (/\p{Hiragana}/) { print "this contains hiragana\n";}'  
this contains hiragana  
robert_felty$ echo "安" | perl perl -CIO -nle 'if (/\p{Han}/) { print "this contains kanji\n";}'  
this contains kanji  

This style of character block for regex is supported in many languages, including Java and perl. Note that it is not supported in python using the default “re” module. There is an alternative module called “regex”, which does support it:
regex 2014.02.19 : Python Package Index

One final thought – don’t try to use unicode block ranges, like: [\x{4E00}-\x{9FBF}]. This is prone to error

This entry was posted in bash, java, perl, python, regex. Bookmark the permalink.

Comments are closed.