Solr DataImportHandler preImportDeleteQuery gotcha

One handy feature of the DataImportHandler in solr is that you can group documents by different entities. In the MKB we have a couple different kinds of entities we import – songs, albums, tvshows, etc. Sometimes we make a change or improvement to the underlying data of one type of entity, and want to test it out. Instead of reimporting all the data, we can just reimport that one specific entity. To do this correctly, we need to define a preImportDeleteQuery. When solr does dataimport, if you select to “clean” the data, it will remove all the documents before it imports new ones, so you don’t end up with duplicates. By default, solr will simply delete documents using the query “*:*”, which deletes all documents. That is not what we want to do. We only want to delete a certain type of document matching our entity. We have a type field in solr for every entity. All I have to do id specify a preImportDeleteQuery for an entity in the DataImportHandler configuration file, like so:

 <entity name="sportTeams" dataSource="sports"  
preImportDeleteQuery="type:sportTeam"  
 transformer="RegexTransformer" query="

Today I discovered a gotcha though – I left this off for one of my entities, and when doing a full import of all entities, I was only getting the last entity. This was happening because it was deleting all the documents which had just been imported. I was surprised by this, since I thought that the preImportDeleteQuery only applied when importing one entity at a time, not when importing all entities. Apparently I was wrong – so this is the gotcha
If you use preImportDeleteQuery on any entity in your DataImportHandler configuration file, you should use it for all entities

This entry was posted in lucene, solr. Bookmark the permalink.

Comments are closed.