Having worked at two storage teams, I found it very interesting to go through the two different storage architectures in an article. Warning: this article only gives you a little bit of 30,000 feet super high-level overview and barely scratches the surface of the two very complicated systems.
The first team is the graph storage team of a company that is famous for its endless news-worthy content. One of the most notable graphs I worked on is the social graph – if you think of every user as a vertex in this graph, edges are the following (and followed by) relationships. There are some interesting characteristics that make this social graph topology and read/write pattern fundamentally different compared to its peer comparatives.
The vertex degree can be extremely high. It’s already rare for some user to have 1000 connections on a friend-orientated platform, or 1000 connections on some professional-orientated platform. But it’s pretty common for celebrities to have tens of millions of followers on this platform.
This social graph is read-heavy and write-light-weight, because every time you publish something all of your followers need to be read for content delivery (a jaw-dropping 10 million read QPS was what I observed as of 2016), while writes only happen when you follow/unfollow some user. Reads not only have high QPS, but also need to be lightning fast because there would be a ton of content deliveries if some celebrity publishes (or celebrities publish at the same time).
Even though it’s technically a “graph”, it is not a conventional graph that you would imagine. Take graph traversal for example: on friend-orientated platform, it would be a quite common ask to get all friends of your friends so the user-engagement team can recommend new friends to you; on professional-orientated platforms, it would be interesting to compute how many connections hops are between you and someone so you can ask for potential references when you want to connect with some important user. But here, we don’t really care who are the followers of all followers of the presidents of United States (We are still interested in who are following popular Singer A and popular Singer B simultaneously though).
MySQL is used to store the social graph, at least for many years until I left (We can temporarily hold off the discussion on why use a row-based relational database to store a social graph). As you would expect it is a distributed cluster that spans more than one data-center, with sharding to increase throughput and data replication to increase durability/availability. If we simplify case to one data-center, any data is stored in four replicas.
The above figure shows a simplified piece of the storage system, there is a so-called app server tier sitting between client and database. Any app server has the same functionality but not “entirely” stateless. Based on some load balance metrics, the client chooses a random app server to use. For reads, the app server reads from one random database host among the four that contains the data (app server has the cache for topology) and returns the data to the client. For writes, the app server persists the request to its local disk-backed log and acknowledge immediate to the client. At this time point, client thinks the write is successful but actually it is not yet. On the other end, the requests in the log are serviced one by one and written to all four databases. If some or all databases are down during the moment, the app server will put the write request into the queue again.
This scheme is simple and effective, notably bringing very high write availability – no matter how many data copies are there at the backend it does not affect availability, since the client just needs to deliver writes to one app server. It is the app server’s responsibility to deliver the data to all database replicas later on. The shortcomings are easy to spot – there is no read-after-write consistency, which is acceptable – suppose you click the “follow” button of some user and it takes a little while for his/her latest content to appear on your timeline, no big deal and we call it eventual consistency. In very unfortunate case though, that “write success” acknowledgement that client gets could be a sheer or partial lie – if the app server’s disk crashes before all replicas get the data, the data is either totally lost or they are discrepancies between these replicas. But this is still within acceptable range – suppose that you clicked the button to follow your favorite celebrity. A day later you find that you did not see his/her content so you go to the person’s profile page and click “follow” again. User experiences are impacted but it is not catastrophic in and out.
After some time, I moved on to another storage team at some leading cloud computing company, which provides block storage as a service. Simply put, the work we do is to physically decouple block devices from virtual computing instances, realized with using a storage node cluster and convert disk I/O to network I/O. In a customer’s eye, he is using a traditional block device without feeling any difference, but under the hood the block device is away from the physical machine that his computing instance is running on, and a single logical block device may even be serviced by multiple physical devices. The most common use case is to setup file systems on top of it and deploy mission critical relational databases to the file system.
The famous CAP theorem states, in a partitioned system, consistency and availability cannot be simultaneously be achieved. Yet, we desire availability and consistency both to a very high degree. Availability should not need much explanation, since block devices (local disks) are already relatively reliable on a computer and our team advertises that our type of “virtual” block device even has a much lower failure rate that a commodity block device.
Why do all replicas need to be strictly consistent? In my previous graph storage scenario, it is acceptable occasionally (or even permanently!) that when you query if you are following some person, database A replies yes and database B replies no. It is also easily fixable by sending the write again to overwrite the inconsistency. However, for block storage, if there is one byte inconsistency between replica A and replica B, it could mean that you have two different views of a file system! A single byte’s contamination can lead to a Terabyte device to be junk. So, if you cannot guarantee consistency, availability is meaningless.
The figure below shows a simplified snippet of the block storage service architecture, which looks rather trivial at first glance. We store a volume as two replicas (on two storage nodes). Low latency is highly desired so there is no app server like proxy involved. Client talks to the storage node directly. Network connections are a non-negligible overhead and we don’t want the client to jump between two replicas, so we assign (through some means) one replica to be master and one to be slave. Client only talks to master (more accurately, whichever replica the client is talking to has the master status). For reads, master grabs the data on its local disk and return immediately. For writes, master replicate them to slave first, only if that’s a success, master writes to itself and acknowledge the client.
Now consider some failure cases. When slave fails, should master stop serving requests? Apparently nope. So master would “quickly” accept the status quo and act as a “solo” master, which means that the master will no longer seek replication before acknowledge the client. In the meanwhile, the master would try to provision a new slave, copy all the existing and incoming data to it and recover the master-slave topology. When master fails, the client would flip to the slave and the slave would be promoted as master. Likewise, now the master is a “solo” master and it should provision a new slave to recover the master-slave topology as soon as possible.
If you are familiar with distributed storage topic, you might already have found that I used a lot of vague terms without much clarification above. Also, lots of details are omitted. For example, how to update the storage node software? If a storage node is temporarily brought down for software update, from the client’s perspective it is a failing storage node. Does that mean the peer takes over, provisioning should happen and a whole device’s data need to be copied? In the early days that was indeed what life looked like. Then, people thought it should be better to wait for the “failing” node for a while, and if it recovers, no full device copy is required. Master can just send those incremental changes happening when the node was in “failing” stage. We call it “catch-up” recovery. Now needless to say the software needs to be more complicated because we need to keep track of what and where these incremental changes are. Wait, if you think further, what if there is a network partition happening? As illustrated in the figure below, suppose that B was put in some unknown-reason network isolation for a while, and A becomes a solo serving master and at the same time waiting for the “failing” slave to recover. If during this period the A unfortunately enters some unknown-reason network isolation and B goes out of network isolation, the client would flip to B thinking it contains the most up-to-date data, but actually it does not! This is essentially a split-brain scenario and introduced by that wait-for-failing-peer-to-recover mechanism. Now we need extra mechanism to prevent this kind of risk introduced by a performance improvement change!
At its infancy, the storage node software is trivial and the system architecture is much simpler. Over the years many features are added to improve efficiency, but they also introduce lots of complications or even “bugs”. A lot of such “bugs” are hard to find during development stage because they only happen when a series of coincidental failures chained sequentially (but they do happen given the scale we operate). As of January 2018, the storage node software contains a staggering 300,000 lines of C++ code, and it is safe to say that a great portion of it is dealing with very rare scenarios. Today, when an engineer submits code change to improve the system architecture, even senior and experienced engineers can hardly ascertain if it is a safe change. Extra validation tools are needed to simulate tens of millions of combinational scenarios to make sure all pre-set consistency constraints are still met. I would say that the root cause for this situation is that the team wants to push both availability and consistency to the extreme.
To summarize, the graph storage aims at achieving unprecedented availability at the cost of consistency; while the block storage views availability and consistency equally important. This lead to very different system architectures. You might conclude that the latter is harder, but that is not entirely (or at all) true, as I will discuss in later articles.