Interesting database concepts and algorithms to know

An overview of some database concepts and related algos

I was reading an article about how Github’s new code search works (highly recommend the read) and stumbled upon a lot of new concepts that I didn’t know. To be more specific, I wanted to know more about:

How database indexes work
Inverted indexes
Ngram
Database iterators
Content addressable storage or hashing
Tree data structure
Delta encoding
Minimum spanning tree
K merge strategy

Some of those topics I kinda knew but never really took the time to properly (and confidently) understand, so today was a good opportunity to dive (not too) deep into them. Starting with:

Database indexes

In simple terms, database indexes are data structures placed inside the database engine that help you perform better searches over your data. One good explanation I found in stack overflow was a comparison with book indexes: those pages at the beginning of a book help the user to go to a certain section and find the right page, without having to go through every single page linearly.

We can index any data we want in a database, in fact, we can have multiple indexes for a table. Take for example a Users table, with a name, userId, age, and gender columns. When choosing an index, it’s always good to take into account the data’s cardinality - how often a given data can have unique values. Data with high cardinality means there’s a greater chance for them to be unique, while data with low cardinality are more prone to have repeated values. Choosing indexes with high cardinality is better since it’s way more effective to perform a binary search when your indexes are more specific.

For example, think of having an index for the gender column (assuming we would have a male and female value). It would be kind of ineffective because the data would be split in two and the search engine would have to go through half of the records linearly to find its result. On the other way around, having an index with high cardinality means splitting the data into multiple records, generating a better environment for a more effective binary search. In the example above, name and userId would be good choices for indexes.

How an index is implemented depends on the database engine and how it was built, but generally, every time we create a new index, a new adjacent data structure (such as B trees, hash maps) is created within the database. Think of it as a new table with two columns: the data you are indexing (name or userId in our example) and a reference/pointer to the value in the original table. This new index table is sorted and ordered, optimizing for a binary search.

As indexes create a new table in your database, we have to be careful when creating them. Apart from occupying extra space, they can also slow down the INSERT queries as the database engine has to create a new record to the original table and the new index table. So don’t just mindlessly create new indexes for all your table columns lol.

Inverted indexes

Still in the subject of indexes, we have index data structures optimized for text based searches: forward index and inverted index. They are both like hashmap structures, but with an important difference: forward indexes uses the document as the key and the terms (or words) as the values.

Meanwhile, the inverted index do the opposite (surprised?) and uses the term/word as key and the document as value.

For instance, this is an example of forward index:


  Document                    Terms

  São Paulo is a nice city    city, cosmopolitan, concrete, buildings
  Rio is beautiful            city, beach, nature, coconut

And this is the same example as inverted index:


  Term            Documents

  city            São Paulo is a nice city, Rio is beautiful
  cosmopolitan    São Paulo is a nice city 
  beach           Rio is beautiful
  concrete        São Paulo is a nice city

As we can see from above, the forward index should be faster to write data to the DB as we just have to append a new document with its terms, whereas with an inverted index we have to strip down the words and rebuild the index every time there’s a new insert (eg. a new word may be present in an existing document).

For reads, the inverted index is actually faster as we only have to search for onde word to retrieve a given document, whereas for forward indexes we need to go through all the entries to search for the terms and check which documents contains a given term.

Content addressable storage

This is a way to store and retrieve information based on its content. The content goes through a cryptographic hashing function that generates unique keys for a given content. If the content is updated, a new key will be generated and a new object for storage will be created, avoiding content duplication in the database. Its a good way to ensure the file’s uniqueness and validity.