Overview

What is OlegDB

OlegDB is a concurrent, pretty fast K/V hashtable with an Go frontend. It uses the Murmur3 hashing algorithm to hash and index keys. We chose Go for the server because it is easy to rapidly create an HTTP frontend that is performant and has all the tools in core to prevent race conditions.

In addition to this, liboleg is the C library that powers everything. liboleg exports a relatively simple API for use in other applications. We build the main database off of this library.

Installation

Installing OlegDB is pretty simple, you only need a POSIX compliant system, make, gcc/clang (thats all we test) and Go. You'll also need the source code for OlegDB.

Once you have your fanciful medley of computer science tools, you're ready to dive into a lengthy and complex process of program compilation. Sound foreboding? Have no fear, people have been doing this for at least a quarter of a century.

I'm going to assume you've extracted the source tarball into a folder called ~/src/olegdb and that you haven't cd'd into it yet. Lets smash some electrons together:

$ cd ~/src/olegdb
$ make
$ sudo make install

If you really wanted to, you could specify a different installation directory. The default is /usr/local. You can do this by setting the PREFIX variable before compilation:

$ sudo make PREFIX=/usr/ install

Actually running OlegDB and getting it do stuff after this point is trivial, if your installation prefix is in your PATH variable you should just be able to run something like the following:

$ olegdb -config /path/to/json/config

OlegDB ships with a default configuration file, olegdb.conf.sample which will get you up and running.

Getting Started

Communicating with OlegDB is done via a pretty simple REST interface. You POST to create/update records, GET to retrieve them, DELETE to delete, and HEAD to get back some information about them. Probably.

For example, to store the value Raphael into the database named turtles under the key red you could use something like the following:

$ curl -X POST -d 'Raphael' http://localhost:8080/turtles/red

Retrieving data is just as simple:

$ curl http://localhost:8080/turtles/red

Deleting keys can be done by using DELETE:

$ curl -X DELETE http://localhost:8080/turtles/red

OlegDB supports lazy key expiration. You can specify an expiration date by setting the X-OlegDB-use-by header to a *UTC* POSIX timestamp.

$ curl -X POST \
-H "X-OlegDB-use-by: $(date +%s)" \
-d '{turtle: "Johnny", age: 34}' http://localhost:8080/turtles/Johnny
> POST /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0
> Host: localhost:8080
> Accept: */*
> X-OlegDB-use-by: 1394323192
> Content-Type: application/octet-stream
> Content-Length: 27
>
\* upload completely sent off: 27out of 27 bytes
< HTTP/1.1 200 OK
< Server: OlegDB/fresh_cuts_n_jams
< Content-Type: text/plain
< Connection: close
< Content-Length: 7
<
無駄

And then when we try to get it back out again:

$ curl -v http://localhost:8080/turtles/Johnny
> GET /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Status: 404 Not Found
< Server: OlegDB/fresh_cuts_n_jams
< Content-Length: 26
< Connection: close
< Content-Type: text/plain
<
These aren't your ghosts.

As you can hopefully tell, the POST succeeds and a 200 OK is returned. We used the bash command date +%s which returns a timestamp. Then, immediately trying to access the key again results in a 404, because the key expired.

If you want to retrieve the expiration date of a key, you can do so by sending HEAD:

$ curl -v -X HEAD http://localhost:8080/turtles/Johnny
> HEAD /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
\* Server OlegDB/fresh_cuts_n_jams is not blacklisted
< Server: OlegDB/fresh_cuts_n_jams
< Content-Length: 0
< Content-Type: application/octet-stream
< Expires: 1395368972
<

Cursor Iteration

In 0.1.2, we added the ability to iterate through keys inserted into the database via the frontend. It's a pretty simple interface and follows the rest of the current URL idioms.

Each cursor operand is of the form /database/key/operand. In some operands (_last and _first) the key option is the operand. Using them is trivial.

Using any of the operands will return both the value of the key you requested (_next will return the next value, _prev will return the previous value, etc.) and the HTTP header X-Olegdb-Key followed by the key paired to the value you just retrieved. For example, say we have two keys in the database, aaa and bbb. To begin with, I can request the first key in the database:

$ curl -i localhost:8080/oleg/_first
HTTP/1.1 200 OK
X-Olegdb-Key: aaa
Date: Sun, 28 Sep 2014 07:23:39 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8

I am the value of aaa

As you can see, the key aaa is the first one in the tree of ordered keys. If you're paying attention, you've also noticed that I've omitted the parameter between the database specifier and the cursor operand _first. This is because the key is not used in this command. It will, however, be used in the next:

$ curl -i localhost:8080/oleg/aaa/_next
HTTP/1.1 200 OK
X-Olegdb-Key: bbb
Date: Sun, 28 Sep 2014 07:24:16 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8

I am the value of bbb

Logically, the key bbb follows the key aaa. Nice. In our request, we asked Oleg for the key after (_next) the key aaa. The value of the next key was returned, and we can see the header X-OlegDB-Key is set to the key that corresponds to that value. Lets see what happens if we try to get the next key, knowing that we only have two keys (aaa and bbb) in our database:

$ curl -i localhost:8080/oleg/bbb/_next
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
Date: Sun, 28 Sep 2014 07:24:26 GMT
Content-Length: 17

No records found

We get a 404 statuscode and a message to match. This informs us that we cannot iterate any farther and that we have reached the end of the list.

In addition to these to commands, you can use _last to find the last key in the database and _prev to iterate backwards. The usage of these commands is identical to those above:

$ curl -i localhost:8080/oleg/_last
HTTP/1.1 200 OK
X-Olegdb-Key: bbb
Date: Sun, 28 Sep 2014 07:24:50 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8

I am the value of bbb

$ curl -i localhost:8080/oleg/bbb/_prev HTTP/1.1 200 OK X-Olegdb-Key: aaa Date: Sun, 28 Sep 2014 07:25:06 GMT Content-Length: 21 Content-Type: text/plain; charset=utf-8

I am the value of aaa

$ curl -i localhost:8080/oleg/aaa/_prev HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 Date: Sun, 28 Sep 2014 07:25:17 GMT Content-Length: 17

No records found

Prefix Matching

In addition to cursor iteration 0.1.2 added the ability to return the keys that match a given prefix. Use of this feature follows the same URL layout as it's predeccesors, mainly via the use of the _match qualifier.

For example, say I have three keys in the database, test_a, test_b and test_c. I can easily find these keys in one operation by using the _match operand. To demonstrate:

$ curl -i localhost:8080/oleg/test/_match
HTTP/1.1 200 OK
Date: Sun, 28 Sep 2014 07:26:35 GMT
Content-Length: 20
Content-Type: text/plain; charset=utf-8

test_a test_b test_c

This returns a list of all the keys separated by \n. Also of note is the X-Olegdb-Num-Matches header which specifies the number of keys that matched the given prefix.

If no matches are present, a 404 is returned.

Similar to prefix matching, you can also just dump the entire keyspace using '_all', keep in mind however that this can be an expensive operation.

$ curl localhost:38080/waifu/_all | head
HTTP/1.1 200 OK
Content-Length: 27863
X-Olegdb-Num-Matches: 401
Date: Sun, 11 Jan 2015 21:26:08 GMT
Content-Type: text/plain; charset=utf-8

alias50B224D2C7987CE4F51E9258707758841771C82E9A0D3395C849426F6E93B8A85FE94AB42A00845C alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031372339E4C8F94AA2 alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031383CEF2534DF870A alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031717594C273021004 ...

Technical Internals

Hash Table

At it's core, OlegDB is just a hashtable. On a good day, this means operations are O(1). Since we use linked lists to handle collisions see here the worst-case scenario for operations is O(n). This usually doesn't happen.

Rehashing happens when we run out of space. To handle this, we currently allocate a new block of memory, rehash all keys and move everything over. This is a blocking operation. If you know you're going to have a lot of keys and want to avoid this, you can tweak the HASH_MALLOC parameter before compilation. This controls the default amount of space that OlegDB will allocate.

Splay Trees

In addition to the hash table, OlegDB also keeps track of currently inserted nodes via a splay tree. Going over the intricacies of splay trees is a little outside the scope of this documentation, but we do use it for several things and for several reasons.

In their simplified form, splay trees are just a specialized form of a self-balancing binary search tree. This means that searching for any given key in the tree is an O(log n) operation and can be done relatively quickly.

In addition to be a binary tree, a splay tree has the property of moving recently inserted keys to the top of the tree. This is known as a splaying operation. While some splay tree implementations splay on read, write and deletion, OlegDB only splays keys to the top of the tree upon insertion and deletion. We figured that, since the splay tree is at most a secondary structure in the Oleg ecosystem, we wanted it's impact to be minimal.

With splay trees installed, we can now iterate through the tree in a timely and efficient manor, and whats more, in a user-decided order.

Binary trees have several modes of traversal that can be useful in a database context. Traversing the tree in-order gives the user the ability to retrieve keys alphabetically, while traversing in a pre-ordered fashion will show the user when the keys were inserted.

Besides key-traversal, splay trees are used for prefix matching. Since binary trees are inherently sorted, we can iterate through one much faster than we could a list.

Splay trees can be turned on/off by changing how you open a database. See ol_feature_flags for a complete list of toggleable parameters.

LZ4 Compression

OlegDB uses the super-fast LZ4 compression algorithm for keeping values on disk in a smaller size while maintaining a low-impact on insertion/deletion.

This is a toggleable feature. See ol_feature_flags for more detail.

AOL File

The Append Only Log file is how Oleg keeps track of state outside of it's values files. Everytime a change occurs to OlegDB, that command is written to the AOL file. This is what allows OlegDB to be persistent.

Every now and then the AOL file needs to be squished (compacted) to remove old and expired data. The AOL file is designed to be human readable (mostly) so you can tell at a glance whats going on with your database.

Values File

The values file augments the AOL file in persisting state to the disk. The values file is basically all of your data, more or less aligned in four megabyte blocks.

Starting in 0.1.2, this is how we store data on disk. Previously all values were stored in the AOL file. Instead, the values file is mmap()'d into RAM on database startup, allowing you to hold datasets bigger than memory.

liboleg

Macros

VERSION

#define VERSION "0.1.5"

The current version of the OlegDB.

KEY_SIZE

#define KEY_SIZE 250

The hardcoded upperbound for key lengths.

HASH_MALLOC

#define HASH_MALLOC 65536

The size, in bytes, to allocate when initially creating the database. ol_bucket pointers are stored here.

PATH_LENGTH

#define PATH_LENGTH 256

The maximum length of a database's path.

DB_NAME_SIZE

#define DB_NAME_SIZE 64

Database maximum name length.

DEVILS_SEED

#define DEVILS_SEED 666

The seed to feed into the murmur3 algorithm.

VALUES_FILENAME

#define VALUES_FILENAME "val"

The file extension used for the values file on disk.

VALUES_DEFAULT_SIZE

#define VALUES_DEFAULT_SIZE 4194304

The default size of the values file on disk. 4 MB by default.

AOL_FILENAME_ALLOC

#define AOL_FILENAME_ALLOC 512

The number of bytes we allocate for the filename of the AOL file.

AOL_FILENAME

#define AOL_FILENAME "aol"

The file extension used for the AOL file.

Type Definitions

ol_key_array

typedef char **ol_key_array;

This is shorthand for a pointer to an array of keys, the same kind of key stored in an ol_bucket's key[KEY_SIZE].

Enums

ol_feature_flags

typedef enum {
    OL_F_APPENDONLY                 = 1 << 0,
    OL_F_SPLAYTREE                  = 1 << 1,
    OL_F_LZ4                        = 1 << 2,
    OL_F_AOL_FFLUSH                 = 1 << 3,
    OL_F_DISABLE_TX                 = 1 << 4
} ol_feature_flags;

Feature flags tell the database what it should be doing.

OL_F_APPENDONLY: Enable the append only log. This is a write-only logfile for simple persistence.

OL_F_SPLAYTREE: Whether or not to enable to splay tree in the server. This can have a performance impact.

OL_F_LZ4: Enable LZ4 compression.

OL_F_AOL_FFLUSH: Make sure AOL data is REAAAALLY written to disk. This will run fflush after every AOL write. Otherwise, fsync only.

OL_F_DISABLE_TX: Disable transactions on this database.

ol_state_flags

typedef enum {
    OL_S_STARTUP        = 0,
    OL_S_AOKAY          = 1,
    OL_S_COMMITTING     = 2
} ol_state_flags;

State flags tell the database what it should be doing.

OL_S_STARTUP: Startup state. The DB is starting, duh.

OL_S_AOKAY: The normal operating state, the database is a-okay

OL_S_COMMITTING: The database is committing a transaction. It doesn't want to do anything else.

Structures

ol_bucket

typedef struct ol_bucket {
    char                key[KEY_SIZE]; /* The key used to reference the data */
    size_t              klen;
    size_t              data_offset;
    size_t              data_size;
    size_t              original_size;
    struct ol_bucket    *next; /* The next ol_bucket in this chain, if any */
    struct tm           *expiration;
    ol_splay_tree_node  *node;
    int                 tx_id;
} ol_bucket;

This is the object stored in the database's hashtable. Contains references to value, key, etc.

key[KEY_SIZE]: The key used for this bucket.

klen: Length of the key.

data_offset: Location of this key's value (data) in the values file on disk.

data_size: Length of the value (data) in bytes. This is the size of the data stored in memory.

original_size: Length of the value (data) in bytes. This is the original length of the data we receieved, non-compressed.

next: Collisions are resolved via linked list. This contains the pointer to the next object in the chain, or NULL.

expiration: The POSIX timestamp when this key will expire.

*node: A pointer to this objects node in the splay tree.

tx_id: If this record is a part of a transaction, then this will be a non-negative integer.

ol_meta

typedef struct ol_meta {
    time_t      created;
    int         key_collisions;
} ol_meta;

Structure used to record meta-information about the database.

created: When the database was created.

key_collisions: The number of keys that have collided over the lifetime of this database.

ol_database

typedef struct ol_database {
    void            (*get_db_file_name)(const struct ol_database *db,const char *p,char*);
    void            (*enable)(int, int*);
    void            (*disable)(int, int*);
    bool            (*is_enabled)(const int, const int*);
    char            name[DB_NAME_SIZE];
    char            path[PATH_LENGTH];
    char            aol_file[AOL_FILENAME_ALLOC];
    FILE            *aolfd;
    int             feature_set;
    short int       state;
    int             rcrd_cnt;
    size_t          cur_ht_size;
    ol_bucket       **hashes;
    unsigned char   *values;
    int             valuesfd;
    size_t          val_size;
    ol_splay_tree   *tree;
    ol_splay_tree   *cur_transactions;
    ol_meta         *meta;
} ol_database;

The object representing a database. This is used in almost every ol_* function to store state and your data.

get_db_file_name: A function pointer that returns the path to the location of the db file to reduce code duplication. Used for writing and reading of dump files.

enable: Helper function to enable a feature for the database instance passed in. See ol_feature_flags

disable: Helper function to disable a database feature. See ol_feature_flags

is_enabled: Helper function that checks weather or not a feature flag is enabled.

name: The name of the database.

path[PATH_LENGTH]: Path to the database's working directory.

aol_file: Path and filename of the append only log.

aolfd: Pointer of FILE type to append only log.

feature_set: Bitmask holding enabled/disabled status of various features. See ol_feature_flags.

state: Current state of the database. See ol_state_flags.

rcrd_cnt: Number of records in the database.

key_collisions: Number of key collisions this database has had since initialization.

created: Timestamp of when the database was initialized.

cur_ht_size: The current amount, in bytes, of space allocated for storing ol_bucket objects.

**hashes: The actual hashtable. Stores ol_bucket instances.

*values: This is where values for hashes are stored. This is a pointer to an mmap()'d region of memory.

valuesfd: The file descriptor of the values file.

val_size: The size of the sum total of records in the db, in bytes. It is not the size of the file on disk.

*tree: A pointer to the splay tree holding the ordered list of keys.

*cur_transactions: The current open/uncommitted transactions. Represented with a splay tree.

*meta: A pointer to a struct holding extra meta information. See oleg_meta for more information.

Functions

ol_open

ol_database *ol_open(const char *path, const char *name, int features);

Opens a database for use.

*path: The directory where the database will be stored.

*name: The name of the database. This is used to create the dumpfile, and keep track of the database.

features: Features to enable when the database is initialized. You can logically OR multiple features together.

Returns: A new database object. NULL on failure.

ol_close

int ol_close(ol_database *database);

Closes a database cleanly and frees memory.

*database: The database to close.

Returns: 0 on success, 1 if not everything could be freed.

ol_unjar

int ol_unjar(ol_database *db, const char *key, size_t klen, unsigned char **data, size_t *dsize);

This function retrieves a value from the database. data must be freed after calling this function! It also writes the size of the data to dsize. Pass dsize as NULL if you don't care.

*db: Database to retrieve value from.

*key: The key to use.

klen: The length of the key to use.

data: This parameter will be filled out with the data found in the DB. Passing NULL will check if a key exists.

*dsize: Optional parameter that will be filled out with the size of the data, if NULL is not passed in.

Returns: 0 on success, 1 on failure or if the key was not found.

ol_jar

int ol_jar(ol_database *db, const char *key, size_t klen, const unsigned char *value, size_t vsize);

This is OlegDB's canonical 'set' function. Put a value into the mayo (the database). It's easy to piss in a bucket, it's not easy to piss in 19 jars.

*db: Database to set the value to.

*key: The key to use.

klen: The length of the key.

*value: The value to insert.

vsize: The size of the value in bytes.

Returns: 0 on success.

ol_expiration

struct tm *ol_expiration_time(ol_database *db, const char *key, size_t klen);

Retrieves the expiration time for a given key from the database.

*db: Database to set the value to.

*key: The key to use.

klen: The length of the key.

Returns: Stored struct tm * representing the time that this key will expire, or NULL if not found.

ol_scoop

int ol_scoop(ol_database *db, const char *key, size_t klen);

Removes an object from the database. Get that crap out of the mayo jar.

*db: Database to remove the value from.

*key: The key to use.

klen: The length of the key.

Returns: 0 on success, and 1 or 2 if the object could not be deleted.

ol_uptime

int ol_uptime(ol_database *db);

Gets the time, in seconds, that a database has been up.

*db: Database to retrieve value from.

Returns: Uptime in seconds since database initialization.

ol_spoil

int ol_spoil(ol_database *db, const char *key, size_t klen, struct tm *expiration_date);

Sets the expiration value of a key. Will fail if no ol_bucket under the chosen key exists.

*db: Database to set the value to.

*key: The key to use.

klen: The length of the key.

expiration_date: The UTC time to set the expiration to.

Returns: 0 upon success, 1 if otherwise.

ol_ht_bucket_max

int ol_ht_bucket_max(size_t ht_size);

Does some sizeof witchery to return the maximum current size of the database. This is mostly an internal function, exposed to reduce code duplication.

*ht_size: The size you want to divide by sizeof(ol_bucket).

Returns: The maximum possible bucket slots for db.

ol_prefix_match

int ol_prefix_match(ol_database *db, const char *prefix, size_t plen, ol_key_array *data);

Returns values of keys that match a given prefix.

*db: Database to retrieve values from.

*prefix: The prefix to attempt matches on.

plen: The length of the prefix.

*data: A pointer to an ol_key_array object where the list of matching keys will be stored. Both the list and it's items must be freed after use.

Returns: -1 on failure and a positive integer representing the number of matched prefices in the database.

ol_key_dump

int ol_key_dump(ol_database *db, ol_key_array *data);

Like ol_prefix_match, except that it takes no prefix and just dumps the entire tree.

*db: Database to retrieve values from.

*data: A pointer to an ol_key_array object where the list of keys will be stored. Both the list and it's items must be freed after use.

Returns: -1 on failure and a positive integer representing the number of keys in the database.

ol_exists

int ol_exists(ol_database *db, const char *key, size_t klen);

Returns whether the given key exists on the database

*db: Database the key should be in.

*key: The key to check.

klen: The length of the key.

Returns: 0 if the key exists, 1 otherwise.

ol_get_bucket

ol_bucket *ol_get_bucket(const ol_database *db, const char *key, const size_t klen,
                         char (*_key)[KEY_SIZE], size_t *_klen);

Utility function to retrieve an ol_bucket object from the database for a given key.

*db: Database the bucket should be in.

*key: The key to check.

klen: The length of the key.

**_key: The truncated key will be filled in at this address.

*_klen: The length of the truncated key.

Returns: The bucket if it exists, otherwise NULL.

ol_squish

int ol_squish(ol_database *db);

Compacts both the aol file (if enabled) and the values file. This is a blocking operation.

*db: The database to compact.

Returns: 0 if successful, 1 if otherwise.

ol_cas

int ol_cas(ol_database *db, const char *key, const size_t klen,
                            unsigned char *value, size_t vsize,
                            const unsigned char *ovalue, const size_t ovsize);

ol_jar operation that atomically compares-and-swaps old data for new data.

*db: The database to operate on.

*key: The key to check.

klen: The length of the key.

*value: The value to insert.

vsize: The size of the value in bytes.

*ovalue: The old value to compare against.

ovsize: The size of the old value in bytes.

Returns: 0 on success.