Episode 14 - Practical Data Structures: Ordered Indexes vs Hash Tables

13 mins

2020-06-14 engineering

Based On

Hash tables are data structures that map keys into values. Used in Python’s dicts, Go’s maps, Java’s HashMaps, and other places.
However in databases, the default structure is almost always an ordered index, typically a B-Tree.
Hash tables use a fast and repeatable hash function to assign each key a unique place in memory to store its values (sometimes called buckets).
The load factor is the number of entries occupied in the table, divided by the number of unique storage buckets. As it grows, so does the lookup time for its entries.
You want to keep that load factor below 1, but if it’s too low, then you’re wasting memory. If it’s too high, then you have lower access times because you’re storing more than one value inside each bucket.
B-Trees are a different type of data structure built to maintain sorted data for common operations like searching, sequential access, insertion and deletions in logarithmic time.
Each leaf node in the tree contains a number of keys that divide the structure into sub-trees.
A node with 3 children has 2 keys, where the values in the leftmost sub-tree are smaller than the first key, the middle is between the two keys, and the rightmost is higher than the 2nd key.
B-Trees tend to be “general purpose”, which results in a lower total cost for very large persistent data.
They are slower for single value access, but better when considering rare operations at the cost of multiple indexes.

Hash tables provide constant time access for single values, where B-Trees provide logarithmic time access.
Since trees keep values in order, they are more efficient at accessing ranges of values.
When a hash table is growing beyond it’s pre-allocated size, you can find yourself with inserts that involve slow O(n) scans as it gradually copies items to a new table.
Trees, however, will always have their constant O(log(n)) worst case performance.

Databases are built for persistent data that should exist forever, so they tend to store more information both in size and number.

Hash tables are faster for the common case, but only slightly slower than rare full table scans.
But as the data set grows and the load factors increase, that full table scan becomes less of an issue.

When the number of elements is small, it’s cheap to store extra copies of the data indexed differently.
Adding indexes to an established database with lots of records takes up a lot of space and can be time consuming (hours or even days).
As a result, it can be advantageous to reuse an index for different queries. Something you can do with ordered indexes, but with hash tables, you’ll need to create separate indexes.

Storing data on disk so it’s never lost or corrupted is hard. Writes need to be ordered, you must use write-ahead logging, and may need fine grained concurrency control. Much more code than in-memory structures.
Hence, a lot of databases only have a single type of index. One that works for a wide range of workloads, even if it’s slightly less efficient.

In computing architecture, accessing memory vs accessing disk is a difference in several orders of magnitude, depending on how well your cache is performing.
While you may want to avoid a cache miss that fetches something from memory, you also want to avoid going to disk more often.
Using a hash table index on disk implies random access (more look-ups), whereas using an ordered structure allows you to keep “related” data close, therefore optimizing the pages you load in memory for faster access.

Background music by Vendredi from Audio Library+

practicality data structures