Notes on Designing Data-Intensive Applications book by Martin Kleppmann #1

Fundamentals of data intensive apps

Main Focus

The book clearly states from the beginning that the primary goal is to go through the fundamentals of data intensive applications (applications that heavily rely on data) and how each layer of the system works and interact with each other. It will help developers come up with ideas on how to keep your system scalable, highly available, robust and easy to mantain (bold statements, I know). For context, according to the book data-intensive applications are all apps which the primary challenge is to handle:

the quantity of data, the complexity of data, or the speed at which it is changing

Reliability + Scalability + Mantainability

The first chapter talks about these 3 primary concepts that form the baseline for data intensive applications.

Reliability is how fault tolerant an applications is, how it can manage to keep working despite some parts of the system collapsing. They can be caused by hardware faults, software errors or ultimately human errors. To prevent them, some nice ideas are presented:

Mindfully write good error handling code
Deliberately introducing faults and checking how the system behaves
Design the system in a way that prevents erros, discourage wrong patterns
Decouple places that may introduce errors from sensible places to the application
Write tests (unit tests, integration tests, contract tests, etc…) and think about edge cases to cover
Setup a quick recover process, either by rolling back changes or gradually deploy new code
Setup system and API monitoring
Have a reliable dev environment where developers can test critical features before deploying to production

Then we have Scalabilty, which is the system’s ability to handle increasing workload and keep performing well (whatever that means for each service). There’s a really interesting story about Twitter’s scalability issues back in 2012.

They had two main operations, Post tweet which had 4.6k throughput on average (req/sec) and Home timeline view which had 300k throughput on average. When scailing, Twitter faced the fan-out dillema where each user follows many people and are also followed by many people. Therefore, they had two options handling scalability for post tweets and “reading” tweets (Home timeline view):

Post a tweet simply inserts the new tweet into a global collection of tweets and then each user requests this tweet for their home timeline view. From the book, the SQL query would be something along these lines:


    SELECT tweets.*, users.* FROM tweets
      JOIN users ON tweets.sender_id = users.id
      JOIN follows ON follows.followee_id = users.id
      WHERE follows.follower_id = current_user

That’s a nice option but the system started to struggle with the read operations (performing that query for every user whenever they load their homepage)

So to the second option: keep a cache for each user’s home timeline and whenever someone post a new tweet, search for all their followers and insert the new tweet into that cache box. The read requests would be way faster this way, but the downside is needing to insert a new tweet for all followers (imagine someone with 100M followers, that would be 100M inserts whenever that user tweets something!)

The solution was to implement a mix of both approaches: for celebs and big twitter accounts, the first approach was used. For the majority of users though, they keep getting their new tweets inserted into the cache box (deliver the new tweet to each follower). As we can deduce, the number of followers a given user has it’s crucial to determine the approach being used: for high profile accounts, deliver the new tweet for 100M users would be a HUGE amount of write operations so that was ruled out. For all other users, there’s no problem doing so since the read operations (300k throughput) would be much faster and overcome the write operation costs.

As the number of followers is such an important variable to decide the correct architecture, we can call it load parameter. By increasing the load parameter and monitoring the performance of the applications, we can reach to a conclusion on whether our system is scalable or not.

Performance-wise, the book enters the percentiles discussion and shows why it’s the best method for measuring performance for a service. For example, by having all the response times for a given request on the past week, we can use the median (not the mean) as a halfway point: suppose the median is 200ms, we know for a fact that 50% of the users get requests faster than 200ms and other half doesn’t. That’s the 50th percentile right there! Now it’s common to also measure p95, p99 and p99.9 in a large scale company. If the p95 is 1.5s, it means 95% of requests take less then 1.5s. If p99 is 2s, 99% of requests take less than 2s to resolve, but that 1% left waits longer than 2s (yikes!).

Measuring high percentiles is important because, as shown in the book, sometimes they correspond to the most valiable users (user who have a lot of data in them take longer times to compute some request, because they consume a lot). But on the other hand, improving performance of high percentiles like p99.9 is hard and may not be worth it engineering-wise. Ah trade-offs!

We should also watch out for head-of-line blocking where a slow request block other fast requests to fulfill due to some constraint (they are in a queue, or only one request is processed at a time). That’s why measuring client side response times is also important.

Also, if your backend is making multiple requests to fulfill a call, remember that the fastest response will be equal or higher than your slowest request (tail latency amplification).

All that said about scalability, when it comes to really scailing your system you have some alternatives such as:

scailing vertically by increasing the amount of processing power you instance has (more powerful machine)
scailing horizontally by distributing load between many machines

Some system are also elastic, meaning they automatically scale your system according to work load. Others are manual, as the developer itself has to add more machines.

Nowadays, abstracting around scailing distributed systems are evolving quickly (thing AWS auto scailing features) so distributed data system may be the go to today. But in reality, each application is particular and have its own challeneges and requirements.

About mantainability now. It consists of 3 main principles: Operability, Simplicity and Evolvability. Operability means having tools and data in place to make the life easier for developers working on mantaining and monitoring the application (monitoring dashboards, internal tools, good documentation, good default behavior, etc…). Simplicity relies on avoiding unnecessary complexity and using abstraction for the benefit of the application. Evolvability means having a system which is easy to fix bugs, follow nice patterns, is explicity and allows new features to be added without much overhead.

That was it for chapter 1, Reliability, Scalability and Mantainability are key concepts for the book.