System design is one of those topics that I’ve heard a lot about but don’t yet know a lot about. How do we go from personal projects to businesses and tools and serve thousands or millions of people?
There is an apparent gap in my understanding of scaling applications. I’m seeking to rectify that, at least at a high level. I’ve spent some time this week familiarizing myself with some of the buzzwords — namely horizontal and vertical scaling, load balancing, and caching. Here are my notes :)
Vertical vs. Horizontal scaling
When your server starts receiving more requests than it can handle, you’ll need one of two things: a bigger server, or more servers.
Vertical scaling involves increasing the amount of power of your machine — CPU or RAM — to increase the load that the server is able to handle. There are several benefits of storing all data is stored on a single server. Firstly, inter-process communication is faster than network communication amongst servers. Secondly, with this model, there’s no need to worry about inconsistent data across servers.
There are, of course, limits to vertical scaling — your server can only get so big. Also, a single server introduces a single source of failure. If that server goes down, your entire program goes down. Horizontal scaling addresses some of these issues. You can always add new machines to handle an ever-increasing load. And in the event that one server crashes, you can redirect requests to a different machine to keep your program running.
An important consideration with horizontal scaling is the issue of directing requests when there are multiple servers. This is where load balancers come into play. As the name suggests, these “balance” the “load”, directing requests to ensure that roughly the same proportion of requests go to each server.
Load balancers rely on ‘consistent hashing’ to determine which server to send requests to. We know that the load will be ~evenly distributed based on the number of servers. And the number of servers could change — we are talking about scaling, after all. But ideally, we’d like to keep it so that our requests mostly are sent to the same server each time. Consistent hashing is a method of scaling such that adding an additional server makes as small a disruption as possible in the distribution of requests.
The above pie charts are intended to help explain this. Say you have load balancer with 4 servers (chart A) and you want to add a fifth. Charts B and C show different methods of doing so. You can shift all of the pieces to make room for the fifth (chart B), or you can take a small slice from every other piece that will in total form the fifth piece (chart C).
The blue shading represents the amount of “disruption” from adding that fifth piece of pie. By “disruption” I mean data that was previously hashed to one server but is now being hashed to a different one due, to the addition of a new server. Chart C clearly shows less “disruption” as just a small proportion of each servers’ requests are being relocated (as compared to B). This represents consistent hashing.
We’ve established that consistent hashing and therefore (mostly) sending the same request to the same server every time is optimal. But why is that better? One answer is that it enables caching.
Caching can save time by A) reducing network calls and B) storing results of expensive calculations so they don’t need to run multiple times. A load balancer with consistent hashing is useful because if we know that user 1 will always hit server A, we can choose to store user 1’s data on a cache connected to server A.
This sounds pretty great! Maybe we should just cache everything!
Alas, there’s a limit to the effectiveness of a cache, as with anything. The more you cache, the slower the cache. There’s a balance to be struck with what and how much to cache. A popular design is the Least Recently Used (LRU) cache, which uses a doubly linked list and a hash map under the hood.
Fittingly, an LRU cache optimizes by eliminating the least recently used data points as new ones are added.
Diagram 1 shows a cache with data A-F stored per timestamp. When a request for data “B” comes in, the cache notes that it already has “B” saved in memory, and brings “B” to the top (as it is now the most recently used). There are no insertions or deletions — just a rearrangement based on timestamp.
When a new request comes in for “G” in diagram 2, the cache notices that it does not have “G” already saved in memory, and it also does not currently have extra space. Per the LRU principle, the cache will eliminate the least recently used data, which is “A” in this case. “G” is placed on top per diagram 3. Understanding this principle can help us to make informed decisions about when and where to use caching.
So there you have it — 3 of many fundamental concepts in the world of system design. The more I dig into this topic, the more I realize just how vast this topic is…(as is the case with pretty much every topic in this field). I’m going to give myself time to digest what I’ve gathered thus far. There’s more where this came from!