Notes on Designing Data-Intensive Applications book by Martin Kleppmann #2

Databases, storage and query languages

In this chapter, Martin talks about how data models are important (fundamental actually) to software systems. A data model is a way to abstract complexity and enable one layer of the system to communicate effectively with other layers. As an example, the application developer takes raw real world data and transforms into objects, data structures and APIs. When storing those data structures, we model them into our chosen generic data model for storage, such as JSON, XML, tables, documents or graphs. The database model those into bytes in memory, disk or network, so it can be queried or processed. The low level engineers model that data into representation of electrical currents, pulses of light and other low-level stuff.

So modeling your data is essential to make all those layers to interact with each other. In complex system, we can have more and more layers, APIs on top of APIs.

Database model example

The majority of applications today use object oriented strutures, which are hard to map to a relational database model of tables, rows and columns. That’s why ORM exists: to help translation between those layers (application and database).

Imagine we want to express a resumé in a relational model. It would have a users table with a user_id key used by other tables as a foreign key to point back to the users table. Multiple tables like jobs, education and contact_info would have ther own key plus the user_id key pointing back to the users table, expressing a one-to-many relationship (one user with many items).

As the resumé is a self contained data (doesn’t depend on other data), a JSON document model could be well suited:


    {
      "user_id": 157,
      "first_name": "Bill",
      "last_name": "Gates",
      "region_id": "us:91",
      "industry_id": 131,
      "positions": [
        { "job_title": "Co-founder", "organization": "Microsoft" }
      ],
      "education": [
        { ... }
      ]
    }

Some argue that this JSON representation has less impedance mismatch (heavy lifting convertion) between the application code and the storage layer. The JSON representation also has better locality than a multi table approach, since for relation data you’d have to perform multiple queries (multiple tables) or perform joins to retrieve the full resumé, whereas in JSON all the relevant data in within one single query.

A one-to-many data representation implies a tree-like structure for your data, beginning with a user at the top and going down the tree with its relationships, until reaching the leafs.

Normalization

It’s important to try removing duplications inthe database. For example, instead of accepting a free text form from the client, prefer to have a drop down list with predefined values. That way, we can store the data as a simple ID in the database, instead of having duplicate values like “Seattle” and “Great Area of Seattle”. Those values only have meaning to the humans, but if we store them as IDs in database, we remove that duplication concern and can benefit from it:

if you are duplicating values that could be stored in just one place, the schema is not normalized

Different types of relationships

Many to One relation occurs when, using the resumé example, many people live in a particular region or work in a particular industry. This doesn’t fit too nicecly in the document model, whereas in relation model we can refer to rows in other tables by ID (and perform joins).

When a database doesn’t support joins, we have to emulate join in the application code by making multiple queries to the database and parsing the value.

Document databases are good for one to many relationship as it can nest records in the document, but make many to one and many to many more difficult as it needs a document reference and additional queries. Relational databases on the other hand provide support for joins and many to one | many to many relations.

The choice for which database model to use depends on your application needs: if it needs highly interconnected data, document model isn’t the best choice. You can use either a relational model or graph model.

Schema flexibility for Document model

Not having a schema is good for flexibility but it also means that application code doesn’t have a guarantee that the data being requested will be there (or ever exist), so it should carefully handle those cases. Saying that document based models are schemaless is misleading since there is a implicit schema being requested by the application (but it’s not enforced by the database).

Schema on read (the structure of data is implicit and only interpreted when data is read), in contrast with schema on write (tradicional approach of relational DB, where the database ensures all written data conforms to it)

These differences are most notable when you need to change data on your database. For document model DBs, you’d basically just start writting the new data format, and handle the difference in the application code:


    if (user && user.name && !user.first_name) {
      // Documents written before Dec 8 don't have first_name
      user.first_name = user.name.split(" ")[0]
    }

Meanwhile on relational DBs that enforce a schema, you’d have to perform a migration:


    ALTER TABLE users ADD COLUMN first_name text;
    UPDATE users SET first_name = split_part(name, " ", 1); // Postgres
    UPDATE users SET first_name = substring_index(name, " ", 1) // MySQL

The UPDATE command will be slow since it has to update every row of the databse. If that’s not acceptable, we can handle the change in application code just like the document model does, and initially leave the first_name set as NULL but then update that value on every read (using application code to do that).

If you are working on a heterogeneous system with different and incosistent data, or depends on third party systems to provide data which you have no control over, a schema may hurt more than help, so document base models are a good choice.

Data locality

A document is usually stored a single encoded string as JSON or XML, or a binary variant like BSON for MongoDB. If your application need to access and retrieve this entire data document at once, there are advantages for a document model over a relational model, since data split into multiple tables and multiple index lookups will take more time and require more disk.

Query Languages

There are two ways of querying data: declarative or imperative.

The imperative way is commonly used in programming languages and can be translated like the following:


    function getData() {
      var data = []

      for (var i = 0; i < allData.length; i++) {
        if (allData[i].name == "Good") {
          data.push(allData[i])
        }
      }
    }
  

We can see from above that the code tells the program exactly how to fetch the data and the instructions are read line by line. This is imperative programming.

On the other hand, SQL and other query languages introduce a more declarative way of querying for data:


    SELECT * FROM allData WHERE name = 'Good';
  

We specify what we want to fetch and the constraints (conditions) and the database engine will be in charge of deciding how to fetch this data. If the database engine decides to change algorithms internally, it can safely do it without impacting the query syntax used, because it is abstracted behind the “declarative way” of fetching data.

Another example where declarative queries comes handy is CSS:


    li.selected > p {
      background-color: blue;
    }
  

Imagine selecting all child p under the .selected class using imperative code. It would be such a mess, having to get elements and assert on their tags and check if they have a class attached. The declarative language of CSS abstracts all that complexity and let the browser engine decide that for us, we just need to define what we want to “query”.

There’s also a programming model called MapReduce which is a middleground between declarative and imperative query. It is used specially on NoSQL databases like MongoDB and CouchDB, and it’s a low-level way of defining advanced queries:


  db.observations.mapReduce(
    function map() {
      var year = this.observationTimestamp.getFullYear()
      var year = this.observationTimestamp.getMonth() + 1
      emit(year + "-" + month, this.numAnimals)
    },
    function reduce(key, values) {
      return Array.sum(values)
    },
    {
      query: { family: "Sharks" },
      out: "monthlySharkReport"
    }
  )
  

From the code above, the map function is called for each document that matches the query statement. From there, it emits a key-value pair, the key being a combination between year and month, and the value being the number of animals seen in that period. The reduce function is called later and it sums the values for every equal key. The result is outputed in the monthlySharkReport collection. MapReduce functions should have pure function, cannot perform additional queries to DB and can’t have side-effects. Since they are tricky to implement and forces the developer to carefully think about the javascript functions, MongoDB has created a built-in fully declarative query called aggregation pipeline that does the same thing:


  db.observations.aggregate([
    { $match: { family: "Shark" } },
    { $group: {
      _id: {
        year: { $year: "$observationTimestamp" },
        month: { $month: "$observationTimestamp" }
      },
      totalAnimals: { $sum: "$numAnimals" }
    } }
  ])
  

Given said that, feels like NoSQL query language is reinventing SQL in another syntax, since we can achieve the same in SQL like:


  SELECT date_trunc('month', observation_timestamp) AS observation_month,
    sum(num_animals) AS total_animals
  FROM observations
  WHERE family = 'Sharks'
  GROUP BY observation_month;
  

Graph-Like Data Models

It’s recommended for many-to-many relationships in your data model. It consists of two types of objects: vertices (nodes or entities) and edges (relatioship or arcs). Some examples are:

Graphs don’t have to have the same data type for vertices. We can have powerful graphs with heterogeneous data like on Facebook, where vertices represent people, location, events, comments, etc…

There are primarily two ways of structuring graph models: property graph (Neo4j, Titan, InfiniteGraph) and triple-store (Datomic, AllegroGraph).

Property Graph

In this model, each vertex has a:

And each edge has:

Just as an example, We can think of them as two set of relational tables such as:


    CREATE TABLE vertex (
      vertex_id integer PRIMARY_KEY,
      properties json
    );

    CREATE TABLE edges (
      edge_id integer PRIMARY_KEY,
      tail_vertex integer REFERENCES vertex (vertex_id),
      head_vertex integer REFERENCES vertex (vertex_id),
      label text,
      properties json
    );

    CREATE INDEX edges_tails ON edges (tail_vertex);
    CREATE INDEX edges_heads ON edges (head_vertex);
  

We can easily traverse any graph from a given vertex, using the tail and head vertex. We can also use different labels for different kinds of relationships, allowing us to evolve the database as needed using different “types” of vertex. It’s very flexible and extendable.

Neo4j created a query language called Cypher to query property graphs. The sytax to create a vertex and edges is somethings like:


  CREATE
    (NAmerica:Location { name: 'North America', type: 'continent' }),
    (USA:Location { name: 'United States', type: 'country' }),
    (Idaho:Location { name: 'Idaho', type: 'state' }),
    (Lucy:Person { name: 'Lucy' }),
    (Idaho) -[:WITHIN]-> (USA) -[:WITHIN]->(NAmerica),
    (Lucy) -[:BORN_IN]-> (Idaho)
  

And one of the ways to query will be:


    MATCH
      (person) -[:BORN_IN]-> () - [:WITHIN*0..]-> (us:Location {name:'United States'}),
      (person) -[:LIVES_IN]-> () - [:WITHIN*0..]-> (us:Location {name:'Europe'})
    RETURN person.name
  

This query above finds a BORN_IN edge to some vertex and from there it follows a chain of outgoing WITHIN edges until it finds one vertex that matches (us:Location {name:'United States'}). The same query also finds a LIVES_IN edge to some vertex, traverses the graph following a chain of outgoing WITHIN until it finds a (us:Location {name:'Europe'}) match. It is searching for people who was born in US and now lives in Europe.

Doing the same query in a relational database would be super hard, because we don’t know in advance how many “joins” we are performing. It would have to traverse a n number of edges before finding the vertex we want. As you can see from the Cypher query above, we leverage the :WITHIN*0.. which means “follow a WITHIN edge, zero or more times”. And more completely, the -[:LIVES_IN]-> () - [:WITHIN*0..]-> statement for example, finds a LIVES_IN that points to any location, it can be a street, city, state, country. That’s why we use an “empty” ().

Triple Store

It’s mostly the same idea as property graphs, but with different names. It consists of three part statements: subject (vertex), predicate (edge) and object (can be either a vertex or a primitive datatype like string or number).


  @prefix : <urn:example:>.

  _:lucy a :Person; :name "Lucy"; : bornIn _:idaho.
  _:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa.
  

To query triple store data model, we have a query language called SPARQL. Here’s the syntax and the same query written in Cypher:


    (person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) #Cypher

    ?person :bornIn / :within* ?location. # SPARQL
  

Chapter 3 Storage and Retrieval

In the previous section we learned how to store data in databases (data model) and how to ask the data back (query languages). In this chapter we’ll look at the same things but from the database perspective.

We will take a look at two familiar storage engines: log structured and page oriented.

Log might be confused with application logs that describes what’s happening in the code. But here, log have a more general meaning which is append-only data file, or append-only sequence of records.

In order to efficiently find values in the database we need an index data structure, which is basically additional metadata that acts as a signpost and helps us locate the data. This is more efficient than going through all the database looking for occurrences of the key.

An index is an additional structure derived from the primary data. Many databases allow you to add or remove indexes without affecting the primary data. Having indexes usually affects performance of write operations, since those have to be updated every time data is written. This is a trade-off of indexes: they speed up queries but slow downs writes. It is up to you, application developer, to choose the indexes that better suit your needs.

Hash Indexes

The most basic example of indexes is by using a hash map stored in-memory, and having the key mapped to the byte offset in the disk. Something like:


    {
      # key: byte offset
      12345: 64
    }
  

Whenever you write a new input to the database, you update this hash map. When reading, you can lookup to the hashmap by key, grab the byte offset (data location on disk), seek the location and read that value.

When talking about log structure engines (that only append data on writes), you may wonder how they don’t rapidly run out of disk space since it’s append only right? There are some solutions like breaking the logs into segments of fixed sizes, as well as perform compaction which means removing duplicate keys. We can also merge different segments into one at the same time we perform compaction, so we have deduplication of logs and deletion of old segments. Append only logs are fast to write and sequential, so they are more predictable than random writes or overwrites.

Each segment has its own in-memory hash table for indexes, and lookups check the most recent hash table to find a certain key. If don’t find the keys there, it goes to the second most recent hash table and so on.

Important points: hash tables must fit in-memory, so watch out if you have too many keys. In theory they can be stored on disk but it won’t perform well due to the amout of I/O access. Another things is that you need to look up each key individually and cannot do a range query (get keys from 1 to 100).

SSTables

We can change the format of our segment by sorting them by key. SSTable stands for sorted string tables. They take advantage over log with hash indexes because:

They are still a log structure index.

So how to construct and maintain a SSTable? We can follow these steps:

  1. When a write comes in, add to a in-memory sorted tree data structure (often called memtable)
  2. When memtable gets bigger than some threshold, write to disk as a SSTable file. This is efficient because the tree data structure (in-memory) is already sorted by key. This new SSTable becomes the newest database segment, and new writes continue to be written to a new memtable instance.
  3. When read requests comes in, search first in the memtable by key, then in the most recent database segment, then in older segments, etc…
  4. From time to time, run a merging and compact process in the background to combine segment files and discart/remove old ones.

Storage engines built with the merge and compact principles are often called LSM Tree (Log-Structure Merge Tree) engines.

B-Trees

Is the most widely indexing structure used. They are the standard indexing implementation in many relational and non-relational databases.

B-Trees also keep key-value pair sorted by key, but they break the database into fixed sized blocks or pages (usually 4kb each) and write/read the database one page at a time. Each page can be indentified with an address or location, and pages can reference other pages to construct a tree of pages. There’s always a root page with a initial range of keys and references to child pages. Since they are ordered, each child page corresponds to a range of keys. If we are looking for a specific key, we just need to follow the address of child pages until it reaches the leaf page containing the value for that key, or a reference to the page where the value can be found. The number of child page references in one page is called branching factor and tipically is several hundreads. Most databases can fit into a B-tree with 3 or 4 levels (a 4 level B-Tree of 4kb pages and branching factor of 500 can store up to 250TB).

Updates are performed by finding value in a page, updating the value and re-writing the page back to disk. This doesn’t change the location of the page, so all references to this child pages isn’t changed.

In order to make B-Trees databases reliable, we often add another data structure to disk called write-ahead log or WAL, or redo log. Before making any changes to pages, we write an append-only data to this WAL file, so if the DB crashes we have this data struture to restore the B-tree.

Other indexing structures

Those indexing structures we discussed (B-Trees, LSM Tree) are key-value indexes, which are like primary keys in relational databases. Secondary indexes can also be constructed from key-value index, but the difference is that each value will be either a list of matching identifiers or appending a row identifier to each entry. We can create as many secondary indexes we want in a relational database.

On a key-value index, the key is what the query will search for. The value can be either the actual row/document/vertex or a reference to that row/document/vertex that is store elsewhere. In the latter case, the place where those rows are stored is called heap file, and references from the index value point to it. This is useful for secondary index as it prevents duplication of data: multiple secondary indexes can point to the same value, which is kept in one place.

Sometimes though, the read performance gets penalized when reading index values from the heap, so it’s preferrable to store the rows in the index itself. That’s what we call clustered index. For example, in some databases primary key of a table is always stored as a clustered index. A covered index store just some of the table’s columns within the index.

What about performant index to query for multiple keys at the same time? Let’s say a geospatial data with longitude and latitude. A regular key-value index won’t be good enough because we would have to query a particular latitude range and then filter by longitude (or vice versa). Multi column indexes can help with it. A general approach to query two-dimension location is translate it to a single number using space-filling curve, then use a regular B-Tree index. Other alternatives is to use a R-Tree index.

The most common type of multi column index is concatenated index, where we combine several keys into one, for example lastname, firstname. As it’s sorted by lastname first, we can find all people with a particular lastname or a combination of lastname+firstname, but won’t be able to query people for their first name.

For full-text or fuzzy search when you don’t have the exact key to query or you want to find similar keys, there are other structures that allow you to search within a distance (distance of 1 means one letter has been missing, removed or altered), or store a collection of keys to look at.

The approaches we talked about so far all have to deal with disk at some point, because they are more durable (resistent against power lost, etc..) and cheaper than RAM. However, there are plenty of in-memory databases that implement solutions/workarounds that makes them more useful than a caching only database (which are acceptable to be lost after machine restart). For example, some databases write log changes to disk, snapshot to disk or even replicate the whole in-memory state to other machines, so when the machine is restarted, it can restore the current state of it. In memory databases are more performant not only because they prevent frequent access to the disk, but also because they avoid overhead of encoding in-memory data structures to something that can be written to disk.

Analytics and Column oriented warehouses

The most common use case for a database in applications nowadays are transaction processing, meaning low-latency reads and writes (usually from a web interface). Things like placing a sale to database, writing a comment for a social media app, reading the lst of products for a website, they all can be considered transacion processes. On the other hand, we have analytic systems, where databases are used by business people or data scientists to analyse an enormous amount of data, usually with high latency and batch writes.

Databases were created for that specific use case, and they are called Data Warehouses.

Big companies usually have both databases internally because it’s not a good idea to have people making huge queries to your OLTP database, as it can be quite expensive and harm the concurrently execution of transactions (interfering on user actions that require read/write). OLAP therefore are better suited for these internal queries aiming to process a lot of data. They are also built and fed dfferently, as they usually take data from multiple sources and write data through a ETL process (extract, transform, load). Imagine a company having multiple databases for user facing products like ecommerce, stock keeping and logistic. All these databases data go through ETL and the final data (optimized for analytic queries) are dumped into the data warehouses.

The most common data model for data warehouses are the so called star or snowflake, where there is a main table called fact table containing rows for all the individual events (can be views, buys, etc…) and satelite tables called dimension tables. The fact table has some column attributes with data relevant for queries. Other columns in the fact table are basically foreign keys to the dimension tables, so you can imagine the format of a star or snowflake (center pointing to edges in the schema relationship).

Fact tables are usually very wide, they can have more than 100 columns, with attributes and foreigns keys to dimension tables. And the number of rows can be overwhelming, like million or trillion of rows (as they store every single event). For efficienty querying for analytics, a new approach is used: column oriented storage. A analytic query often query just a few columns (date, product sdk, quantity…) so instead of laid out the storage in rows (all values from one row stored next to each other), they are stored in columns. Each column have your storage file and the value index for each column corresponds to a single row:


    date_key file: 140102, 140102, 140103, 140104
    product_sdk file: 69, 69, 74, 31
    store file: 4, 5, 5, 2, 8
    quantity file: 1, 3, 4, 1, 5

Each column has it’s own file with the values. The index 0 of each column (140102, 69, 4, 1) corresponds to the same event, as well as the index 1 (140102, 69, 5, 3) which correspond to anothe event and so on.

These values can be sorted and then compressed with a bitmap encoding, which decreases the size of the files.