In public cloud, throttling, or rate limiting, is ubiquitous – if you are a multi-tenant service you have to make sure none of your tenants is able to abruptly claim so much resources in your system that affects the user experience of other tenants. However, doing throttling in distributed context in inherently hard. To obtain accurate and reliable real time concurrency or requests per second (RPS) numbers you need to offload the tracking to a centralized, standalone service, which is both hard to implement and tricky to get right. So many systems go live without having a centralized traffic tracking system, but still manage to provide the throttling mechanism. In this article I’ll walk through some intuitive approaches that achieves this goal.
The first idea that comes to mind is to let individual servers to handle throttling based on their share of the concurrency/RPS allocated to a particular tenant. For example, if you have a server fleet size of 10 and the allocated RPS for a tenant is 100, then, if the load is shared perfectly even among the servers, each server can assume it should handle no more than 100 / 10 = 10 RPS for the tenant.
However, the tricky part in the above approach is “if the load is shared perfectly even among the servers”. It requires a load balancer that can round-robin really well, which is often hardly the case. In addition, the enforced RPS limit is a function of the fleet size, which means that servers must learn the changes of fleet size in a timely fashion. This adds to implementation complexity. Bummer is even bigger when it comes to super sized server fleet, which is often the case for giant sized cloud providers. Say you have a server fleet at the size of 1000, but the RPS limit for a particular tenant is just 100, so each server is expected to deal with only 0.1 RPS and the load balancer not round-robining well can pose great threat to throttling accuracies.
A natural improvement to local throttling is to find a way to direct all traffic for a tenant to just one server, which is what the term “consistent hashing” is about. In this scenario, at receiving a request, a server will first figure out which server is the designated server based on the consensus hashing algorithm, and forward the request to the designated server. In this scenario we don’t need to worry about imperfect round-robining, as ultimately the processor of requests from the same tenant will be the same server. Thus designated server’s local view of the tenant’s traffic is the global view and throttling becomes easy.
Needless to say any reader with production experiences on large distributed systems know “consistent hashing” may suffer from “hot shard” problems either due to poorly designed hashing algorithms or even just organic traffic growth. Some servers can receive more traffic than others thus become less performant, affecting customer experience as a result. To tackle this problem, we can add capacity to “shards”. For instance, in consistent hashing, essentially each “shard” contains only one server. If we let each shard have two servers, the processing power becomes 2X. But, this comes at a cost – if we provision shard capacity based on the traffic volume on the hottest shard, there might be a waste of resources on not so hot shards.
An improved approach to “consistent hashing” is “rendezvous hashing”, the basic idea of which is we can select K (rather than just one) servers to handle a tenant’s traffic and this K can be dynamically adjusted in response to tenant’s real time traffic. After K servers are selected out we can further reduce the choice down to one server, either by randomly choosing or round-robining (again here you are). Then how do we enforce throttling? Let the ultimate request processor do this job or selector do this job? Apparent the job processor is a better fit as the size of job processors (K) is usually far less than the size of the selectors (the size of the entire fleet). Note that now the problem “how to let local view of traffic reflect true global view traffic” comes back again.
Let’s discuss in a bit more detail how to use K job processor servers do throttling. One method is to preset an RPS limit per server per tenant, which is essentially specifying in initial setup “As a server you are only allowed to process this many RPS per tenant”. Let’s say the tenant is allowed to send 400 RPS and RPS limit per server per tenant is 100, then the selector knows it should have K = 400 / 100 = 4 and robin-round through the 4 processors. This approach is simple and effective, but has the limitation that the tenant RPS limit can only be set in incrementals of RPS limit per server per tenant. For example, a tenant RPS limit of 80 is not achievable with the above example. Another method is to let the selector server propagate the RPS limit per server per tenant information to the request processor servers. For instance for the above example, if tenant RPS limit is 320 and K = 4, each request processor server will be told to at most handle 80 RPS for the tenant. This, of course, adds considerable complexity to the system.
Up to now we have a seemingly decent throttling mechanism, but it is far from perfect. There are many issues that can be specific to your system but we omit in our discussions. For example, with “rendezvous hashing” each tenant will essentially have their own randomly chosen K processor servers. These “K processor servers” can of course overlap. If some servers happen to serve more heavy tenants than others, we would still have a “hot server“ issue. And this can be hard to mitigate because the nature of “rendezvous hashing” is human-touch free selection, so human-driven heat management is not viable. Sharp readers may have already noticed that even though this article discusses about throttling in distributed systems, we have also dedicated so much content discussing other relevant issues, like heat management. In the distributed world, there is hardly any problem you can solve in isolated mode, nor are there any ultimately perfect solutions – the best approach is always understand your requirements, and weigh in carefully.