Dealing with flaky downstream services: simple can be effective

One of the work items that really plagued me was a chargeback service. The story is – our team needs to maintain the record of how many queries are serviced by us for each of our customers, so by the end of each month, the data can be utilized to generate imaginary bills (not real bills that incur money, but only for the purpose of tracking how resources within the company are used).

The inner workings of the chargeback service are simple: we setup a daily cron job that after each day ends, it issues some queries to the monitoring service asking about the relevant query usage in the previous day, and gets back a large piece of text as the response. Then the response is parsed and goes through stages of grouping, sorting and the final results are written to databases owned by the billing team – that’s the whole workflow. It sounds so straightforward that it’s almost like a hit-button-and-forget thing.

But after about half a year without major problems, this chargeback service started to fall apart. The billing team started to ping me about the bad data, like 3 or 4 times in a week. Most of the causes were flaky monitoring service that laid downstream to our service – since a time point I had no clue of, the monitoring service quite often did not return meaningful data to the cron job’s request, so the job thought there was nothing to do and did not write anything to billing team’s database. Re-running the chargeback job manually thus became my daily routine. Although it took only a few minutes, it was so repetitive and frustrating.

I was sure no human could bear with it, so I tried to fix. The first thing that came to my mind was scheduling the cron job to run at a different time of day, which was built on the assumption that the monitoring service was under high workload during some interval of the day so it rejected queries. So I moved the cron job from 1:00 AM UTC to 4:00 AM UTC. It seemed to mitigate the problem a little bit but did not cure it. Later on, I learned that monitoring service could be flaky at any time of the day.

Eventually, I realized that there must be some mechanism for the cron job to understand if the monitoring service was returning a meaningful response or not. Interesting enough, when the monitoring service was behaving badly, it did not throw exceptions or send warning signals, instead, it merely returned a text response that almost contained no data. Making use of this characteristic, I modified the cron job to check if the response contains data fields we are interested in, and if not, the job sleeps for 1 hour and retry.

This fix seemed to be effective, until one day I got paged again by the billing team again. I checked the data and found that though there was something there, it was only half of the normal daily volume. It turned out that one of the symptoms of the monitoring service being flaky was it only returned a partial dataset, which means my cron job not only needs to distinguish between “there is data” and “there is no data”, it also needs to judge if the data looks good. Fortunately, in our company’s scenario there is one viable way to achieve this – pull the previous day’s data and compare with today’s, if the discrepancy is too large (say, 30% increase or decrease), then see it as a failure and wait for 1 hour and retry.

I made the fix and deployed the code. It ran quite well ever since. Now it’s been 3 months and I have not been paged again once because of it. Sometimes a simple solution can be very effective as long as it can address your scenario well.

Leave a Reply Cancel reply