,

How PalDB from LinkedIn works

In October 2015, LinkedIn released PalDB a read-only, in-memory key-value store in Java. PalDB is specifically optimized for fast read performance and memory usage.
For instance, a set of 35M member IDs would only use ~290MB of memory with PalDB versus ~1.8GB with a traditional Java HashSet. [...] Current benchmark on a 3.1Ghz Macbook Pro with 10M integer keys index shows an average performance of ~2M reads/s for a memory usage six times less than using a traditional HashSet.
It is available for download on GitHub. Since the codebase is pretty small, documentation is good and I knew "how to use it" but I could not find any article describing "how it works". I decided then to write this article, take it with caution since I'm not the author of PalDB and all that follows only comes from my reading of the code.

Introduction

The PalDB API is straightforward.
You can put any type of key and any type of value.

StoreWriter writer = PalDB.createWriter(new File("store.paldb"));
writer.put("foo", "bar");
writer.put(1213, new int[] {1, 2, 3});
writer.close();


After calling writer.close() the file store.paldb has finally been created, you can then open it and read values.

StoreReader reader = PalDB.createReader(new File("store.paldb"));
String val1 = reader.get("foo");
int[] val2 = reader.get(1213);
reader.close();


In my opinion, the most important and interesting parts of PalDB are the serialization and the indexation of the data.

serialization  : PalDB may accept any type of variable but it stores everything in byte[] => optimize the memory usage

indexation  : During the writing process, data are stored in temporary files but after a close, PalDB optimize and merge everything in an unique archive => optimize the speed


Serialization


This is were PalDB is going to compress and decompress everything. The central class is StorageSerialization.

Here is how put and get work with the serialization.



As we can see, keys and values are serialized using the same methods (although they will be stored in different files).

The general idea to serialize an object is the following :
  • create an empty byte array
  • add as the first byte a prefix describing the class and other data about our object
  • if the object is an array, compress it with snappy ( https://github.com/xerial/snappy-java )
  • try to pack our object in few bytes as possible and add it to our array


A look at : The prefix byte

The prefix byte gives some insights about our object : is it a primitive, is it an array of primitives, is it a String, is it a custom object ?
We only have 8 primitives, this give us 2*9+1 = 19 different prefixes (actually a litlle bit more, since we also have prefix for int[][] or long[][]). 

But a byte can give us up to 255 values, far more than 19, so PalDB create specific prefixes for some pair of class/values. Some prefix give us enough information to guess the class AND the value of the object.
For example, instead of having a simple prefix BOOLEAN for the boolean values and then having to store another byte to say if it is 1 or 0, we have two prefixes : BOOLEAN_TRUE and BOOLEAN_FALSE,
We also have a custom prefix for the integers between 0 and 8 (INTEGER_0,INTEGER_1,INTEGER_2... ), for the longs between 0 and 8 etc.
In all those cases, the serialized object will be only the prefix, only one byte.

Sometimes, PalDB has to store an int with a value between 8 and 255, we don't have a specific prefix for every values but we notice that a number between 8 and 255 can be stored in only 8 bits. In this case, PalDB will use a prefix INTEGER_255, telling us that the integer has been only store in two bytes (the prefix and a byte representing the value), sintead of using one byte for the prefix and 4 others for the integer.

Some basics examples :



ClassValuePrefixStored as
Integer0INTEGER_0=55
Integer1INTEGER_0=66
Integer14INTEGER_255=1414, -23

A look at : The packers

After giving a prefix, and only for some classes and values, PalDB use Integer and Long Packers. Those packers implement an universal code (VLQ) to convert the integers and long into array of bytes using as less memory as possible. Without VLQ and prefix tricks, uncompressed integers would be stored in 4 bytes and longs in 8 bytes, but for small values, some of those bytes would be filled with only 0. That is useless and VLQ try to get read of those 'empty' bytes by cutting the binary representation of a positive long or int in groups of seven bits with an additional 8th bit,

https://commons.wikimedia.org/wiki/File:Uintvar_coding.svg


In the best case scenario an long will be compressed in only 1 byte (if its value could be represented in only 7 bits), in the worst case in 10 bytes (that is 2 bytes more than an uncompressed long).


A look at : Strings serialization

Non empty string are also compressed with IntPackers : the first byte is the STRING prefix, followed by the length of the string packed with VLQ and finally a list of each chars (converted in int) packed with VLQ

serialize(String) => STRING_PREFIX |  PACK_INT(length)  | PACK_INT({char})

note: Empty String just have the prefix EMPTY_STRING.

Indexation

If you remember the introduction, you cannot read before closing the writer, i.e. you need to open a writer, put all your data, close the writer, open a reader and then you can read.

During the writing (put), many files are opened and some variables, the metadata, are constantly modified in RAM.
Keys and Values are serialized and stored in different groups of files. Inside those groups, data is split according to key length (in bytes) : one file contains all the keys of 1 bytes, another file have the keys of 2 bytes etc, those files are called index files. The same separation is done with values : we have one file containing all the values linked to the keys of 1 byte, another file for the values having a key of 2 bytes etc. those files are called data files.




During the indexation, metadata are saved onto a file, data files are kept and index files are optimized for lookup. All the files are then merged in an unique file.

Metadata will help us find the position of the data (key or value) corresponding to a key length, Since the merge just 'stack' the temporary files data, we only need an offset to get back all the information.




We can see that before and after the indexation, the format of the index files are the same, only the positions of the keys/value_offset are changed. Those new positions will dependsof the hash of the key. Every pair key/offset take the same space called a slot. In a group, keys have the same length but the offset size  can be different(remember we are talking about serialized values : with VLQ, ints can be stored from1 up to 5 bytes), PalDB keeps in memory the biggest length of the offsets and gives then the same slot size to everybody according to it (dark blue in the picture represents the unused space for the offset with a small length). The numbers of slots and the slot sizes are also stored in the metadata.

Since keys are now ordered according to their hash, it becomes, during a read, really fast and easy to find the position of the corresponding slot.

Decomposing a GET

  • Serialize the key
  • Get the keys data (using the metadata and the key length)
  • Compute the hash of the key
  • Get its slot position
  • Return the the value's offset
  • Get the serialized value
  • Deserialize and return  the value