Lately at work at Tumblr I have been working on creating feeds of posts based on the country in which they were published, or the country in which users liked or reblogged a post. I didn’t want to do a simple filter, but rather use a boosting query to boost localized content, while also accounting for the fact that if there wasn’t any content published from your country, you would still see something. I was pretty satisfied with the feed I was getting as I was testing on our development Elasticsearch server with a smallish sample of data. Then when I started testing on our production cluster, I got totally different results. I used XDebug with PHPStorm to set some breakpoints and see exactly what Elasticsearch query was being run. I then pasted that into the Kibana console to do some more debugging. The results from the Kibana console seemed very different from what I was seeing in the PHP app. What was going on? I double-checked a bunch of things. One thing I noticed was that the ES response said that it timed out, but I still got results. I checked against the development cluster again, and it also had timed out, so I thought that was not the issue. Finally I noticed that the PHP app was setting an explicit ?timeout=600ms
parameter, but I was not specifying that in Kibana console. Boom. Same results. So then the question is why? Well, it turns out that it had to do with the type of query I was running. I was running a query with basically looked at all posts – no filtering, and got scores, and then ranked them. Looking at all posts is always a bad idea. I should have known better, but it seemed to be working, so I was okay with it. The issue was when I was looking at a bigger dataset, each shard from Elasticsearch would return partial results, and would rank them according to the query I had defined, but I was looking for a signal which was only present in a small set of the data (looking for posts from Germany), so by the time that ES gave up after 600 milliseconds, there were no posts from Germany. I solved this by filtering to only recent posts, which meant that my boosting query was applying over a much smaller set of posts, and was much faster, and thus didn’t suffer this from timeout issue.
I have been using Elasticsearch on and off for 6 years now, but occasionally, this sort of behavior is still kind of confusing. The notion of “timeout” in the context of Elasticsearch is very different than in most other systems. With most other systems, if you send a request to an API and say only wait for 600ms, if that time has elapsed, and the process isn’t finished, then you will get 0 results. But because of the distributed nature of Elasticsearch, this timeout is actually applied on the shard level. Each shard that is searching will search for 600ms, during which time it may or may not have found some documents which matched the query you defined. If it found some documents, it will return them as partial results.