Unicode block names in regular expressions

Frequently, I find myself wanting to do some simple language detection. For Chinese, Japanese, and Korean, this can easily be done by looking at the types of characters in some text. The simplest and most robust way to do this is to use Unicode block names. It is very simple to write a regular expression which will test if a character is contained in a certain block.
For all the different possible blocks, see here:
Unicode block names for use in XSD regular expressions

Here are some very simple blocks for detecting katakana, hiragana and kanji

robert_felty$ echo "ア" | perl perl -CIO -nle 'if (/\p{Katakana}/) { print "this contains katakana\n";}'  
this contains katakana  
 
robert_felty$ echo "あ" | perl perl -CIO -nle 'if (/\p{Hiragana}/) { print "this contains hiragana\n";}'  
this contains hiragana  
 
robert_felty$ echo "安" | perl perl -CIO -nle 'if (/\p{Han}/) { print "this contains kanji\n";}'  
this contains kanji  

This style of character block for regex is supported in many languages, including Java and perl. Note that it is not supported in python using the default “re” module. There is an alternative module called “regex”, which does support it:
regex 2014.02.19 : Python Package Index

One final thought – don’t try to use unicode block ranges, like: [\x{4E00}-\x{9FBF}]. This is prone to error

Posted in bash, java, perl, python, regex | Comments Off

Monkey patching in python

I was just reading an article about Martijn Pieters, who is a python expert, and he mentioned
monkey patching

I did not know what monkey patching is, so I googled it, and found a great answer on stack overflow

Basically, it takes advantage of python’s class access philosophy. Unlike java, which has a strict access policy, in python, all attributes and methods of a class are mutable. So it is possible to write code like this:

from SomeOtherProduct.SomeModule import SomeClass  
 
def speak(self):  
   return "ook ook eee eee eee!"  
 
SomeClass.speak = speak

This could be particularly useful for unittesting, which is also mentioned in the stackoverflow answer

For instance, consider a class that has a method get_data. This method does an external lookup (on a database or web API, for example), and various other methods in the class call it. However, in a unit test, you don’t want to depend on the external data source – so you dynamically replace the get_datamethod with a stub that returns some fixed data.

Posted in python | Comments Off

Java anchored regex

I just discovered this today when doing some regex in Java. When I first started doing regex in Java, I was surprised to learn that Java seems to treat all regular expressions as anchored. That is, if you have a string foobar and search for “foo” it will not match. This is different than grep, perl, and other tools. In other words, for Java, the following regexes are equivalent:

"foo"  
"^foo$"

If you want to find foo within foobar you need to use

".*foo.*"

I discovered one more interesting tidbit. If you put explicit anchors in, leading and trailing parts of the regex are ignored.

Here are some examples:

// some tests which illustrate implicit anchoring  
"foobar".matches("foo"); //false - rewrite = "^foo$"  
"foobar".matches("bar"); //false - rewrite = "^bar$"  
"foobar".matches("foo.*"); //true - rewrite = "^foo.*$"  
"foobar".matches("bar.*"); //false - rewrite = "^bar.*$"  
"foobar".matches(".*foo.*"); //true - rewrite = "^.*foo.*$"  
"foobar".matches(".*bar.*"); //true - rewrite = "^.*bar.*$"  
"foobar".matches(".*oo.*"); //true - rewrite = "^.*oo.*$"  
// now some tests with optional characters before or after explicit anchors  
// optional characters before or after initial/final anchors have no effect  
"foobar".matches(".*^foo"); //false - rewrite = "^foo$"  
"foobar".matches(".*^foo.*"); //true - rewrite = "^foo.*$"  
"foobar".matches(".*^foo$.*"); //false - rewrite = "^foo$"  
"foobar".matches(".*^foobar$.*"); //true - rewrite = "^foobar$"  
"foobar".matches("[a-z]*^foobar$.*"); //true - rewrite = "^foobar$"  
"foobar".matches(".+^foobar$.*"); //false can't match a character before the beginning of the string
Posted in java, regex | Comments Off

Solr DataImportHandler preImportDeleteQuery gotcha

One handy feature of the DataImportHandler in solr is that you can group documents by different entities. In the MKB we have a couple different kinds of entities we import – songs, albums, tvshows, etc. Sometimes we make a change or improvement to the underlying data of one type of entity, and want to test it out. Instead of reimporting all the data, we can just reimport that one specific entity. To do this correctly, we need to define a preImportDeleteQuery. When solr does dataimport, if you select to “clean” the data, it will remove all the documents before it imports new ones, so you don’t end up with duplicates. By default, solr will simply delete documents using the query “*:*”, which deletes all documents. That is not what we want to do. We only want to delete a certain type of document matching our entity. We have a type field in solr for every entity. All I have to do id specify a preImportDeleteQuery for an entity in the DataImportHandler configuration file, like so:

 <entity name="sportTeams" dataSource="sports"  
preImportDeleteQuery="type:sportTeam"  
 transformer="RegexTransformer" query="

Today I discovered a gotcha though – I left this off for one of my entities, and when doing a full import of all entities, I was only getting the last entity. This was happening because it was deleting all the documents which had just been imported. I was surprised by this, since I thought that the preImportDeleteQuery only applied when importing one entity at a time, not when importing all entities. Apparently I was wrong – so this is the gotcha
If you use preImportDeleteQuery on any entity in your DataImportHandler configuration file, you should use it for all entities

Posted in lucene, solr | Comments Off

Pretty printing json

Here is a really simple way to pretty print some unformatted json

$ echo '{"foo": "lorem", "bar": "ipsum"}' | python -mjson.tool  
{  
    "bar": "ipsum",  
    "foo": "lorem"  
}
Posted in bash, python | Comments Off