Please send comments, corrections and angry letters to Quinlan Pfiffer.
OlegDB
liboleg
Overview ¶
What is OlegDB ¶
OlegDB is a concurrent, pretty fast K/V hashtable with an Go frontend. It uses the Murmur3 hashing algorithm to hash and index keys. We chose Go for the server because it is easy to rapidly create an HTTP frontend that is performant and has all the tools in core to prevent race conditions.
In addition to this, liboleg
is the C library that powers everything. liboleg
exports
a relatively simple API for use in other applications. We build the main
database off of this library.
Installation ¶
Installing OlegDB is pretty simple, you only need a POSIX compliant system,
make
, gcc
/clang
(thats all we test) and Go. You'll also need the source
code for OlegDB.
Once you have your fanciful medley of computer science tools, you're ready to dive into a lengthy and complex process of program compilation. Sound foreboding? Have no fear, people have been doing this for at least a quarter of a century.
I'm going to assume you've extracted the source tarball into a folder called
~/src/olegdb
and that you haven't cd
'd into it yet. Lets smash some electrons
together:
$ cd ~/src/olegdb
$ make
$ sudo make install
If you really wanted to, you could specify a different installation directory.
The default is /usr/local
. You can do this by setting the PREFIX
variable
before compilation:
$ sudo make PREFIX=/usr/ install
Actually running OlegDB and getting it do stuff after this point is trivial, if
your installation prefix is in your PATH
variable you should just be able to run
something like the following:
$ olegdb -config /path/to/json/config
OlegDB ships with a default configuration file, olegdb.conf.sample
which will
get you up and running.
Getting Started ¶
Communicating with OlegDB is done via a pretty simple REST
interface.
You POST
to create/update records, GET
to retrieve them, DELETE
to delete,
and HEAD
to get back some information about them. Probably.
For example, to store the value Raphael
into the database named turtles
under
the key red
you could use something like the following:
$ curl -X POST -d 'Raphael' http://localhost:8080/turtles/red
Retrieving data is just as simple:
$ curl http://localhost:8080/turtles/red
Deleting keys can be done by using DELETE:
$ curl -X DELETE http://localhost:8080/turtles/red
OlegDB supports lazy key expiration. You can specify an expiration date by setting the
X-OlegDB-use-by
header to a *UTC* POSIX timestamp.
$ curl -X POST \
-H "X-OlegDB-use-by: $(date +%s)" \
-d '{turtle: "Johnny", age: 34}' http://localhost:8080/turtles/Johnny
> POST /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0
> Host: localhost:8080
> Accept: */*
> X-OlegDB-use-by: 1394323192
> Content-Type: application/octet-stream
> Content-Length: 27
>
\* upload completely sent off: 27out of 27 bytes
< HTTP/1.1 200 OK
< Server: OlegDB/fresh_cuts_n_jams
< Content-Type: text/plain
< Connection: close
< Content-Length: 7
<
無駄
And then when we try to get it back out again:
$ curl -v http://localhost:8080/turtles/Johnny
> GET /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Status: 404 Not Found
< Server: OlegDB/fresh_cuts_n_jams
< Content-Length: 26
< Connection: close
< Content-Type: text/plain
<
These aren't your ghosts.
As you can hopefully tell, the POST succeeds and a 200 OK is returned. We
used the bash command date +%s
which returns a timestamp. Then, immediately
trying to access the key again results in a 404, because the key expired.
If you want to retrieve the expiration date of a key, you can do so by sending HEAD:
$ curl -v -X HEAD http://localhost:8080/turtles/Johnny
> HEAD /turtles/Johnny HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
\* Server OlegDB/fresh_cuts_n_jams is not blacklisted
< Server: OlegDB/fresh_cuts_n_jams
< Content-Length: 0
< Content-Type: application/octet-stream
< Expires: 1395368972
<
Cursor Iteration ¶
In 0.1.2
, we added the ability to iterate through keys inserted into the
database via the frontend. It's a pretty simple interface and follows the rest
of the current URL idioms.
Each cursor operand is of the form /database/key/operand
. In some
operands (_last
and _first
) the key
option is the operand. Using them is
trivial.
Using any of the operands will return both the value of the key you requested
(_next
will return the next value, _prev
will return the previous value, etc.)
and the HTTP header X-Olegdb-Key
followed by the key paired to the value you
just retrieved. For example, say we have two keys in the database, aaa
and
bbb
. To begin with, I can request the first key in the database:
$ curl -i localhost:8080/oleg/_first
HTTP/1.1 200 OK
X-Olegdb-Key: aaa
Date: Sun, 28 Sep 2014 07:23:39 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8I am the value of aaa
As you can see, the key aaa
is the first one in the tree of ordered keys. If
you're paying attention, you've also noticed that I've omitted the parameter
between the database
specifier and the cursor operand _first
. This is
because the key
is not used in this command. It will, however, be used in
the next:
$ curl -i localhost:8080/oleg/aaa/_next
HTTP/1.1 200 OK
X-Olegdb-Key: bbb
Date: Sun, 28 Sep 2014 07:24:16 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8I am the value of bbb
Logically, the key bbb
follows the key aaa
. Nice. In our request, we asked
Oleg for the key after (_next
) the key aaa
. The value of the next key was
returned, and we can see the header X-OlegDB-Key
is set to the key that
corresponds to that value. Lets see what happens if we try to get the next key,
knowing that we only have two keys (aaa
and bbb
) in our database:
$ curl -i localhost:8080/oleg/bbb/_next
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
Date: Sun, 28 Sep 2014 07:24:26 GMT
Content-Length: 17No records found
We get a 404 statuscode and a message to match. This informs us that we cannot iterate any farther and that we have reached the end of the list.
In addition to these to commands, you can use _last
to find the last key in
the database and _prev
to iterate backwards. The usage of these commands is
identical to those above:
$ curl -i localhost:8080/oleg/_last
HTTP/1.1 200 OK
X-Olegdb-Key: bbb
Date: Sun, 28 Sep 2014 07:24:50 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8I am the value of bbb
$ curl -i localhost:8080/oleg/bbb/_prev
HTTP/1.1 200 OK
X-Olegdb-Key: aaa
Date: Sun, 28 Sep 2014 07:25:06 GMT
Content-Length: 21
Content-Type: text/plain; charset=utf-8
I am the value of aaa
$ curl -i localhost:8080/oleg/aaa/_prev
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
Date: Sun, 28 Sep 2014 07:25:17 GMT
Content-Length: 17
No records found
Prefix Matching ¶
In addition to cursor iteration 0.1.2
added the ability
to return the keys that match a given prefix. Use of this feature follows the
same URL layout as it's predeccesors, mainly via the use of the
_match
qualifier.
For example, say I have three keys in the database, test_a
, test_b
and
test_c
. I can easily find these keys in one operation by using the _match
operand. To demonstrate:
$ curl -i localhost:8080/oleg/test/_match
HTTP/1.1 200 OK
Date: Sun, 28 Sep 2014 07:26:35 GMT
Content-Length: 20
Content-Type: text/plain; charset=utf-8test_a
test_b
test_c
This returns a list of all the keys separated by \n
. Also of note is the
X-Olegdb-Num-Matches
header which specifies the number of keys that matched
the given prefix.
If no matches are present, a 404 is returned.
Similar to prefix matching, you can also just dump the entire keyspace using '_all', keep in mind however that this can be an expensive operation.
$ curl localhost:38080/waifu/_all | head
HTTP/1.1 200 OK
Content-Length: 27863
X-Olegdb-Num-Matches: 401
Date: Sun, 11 Jan 2015 21:26:08 GMT
Content-Type: text/plain; charset=utf-8alias50B224D2C7987CE4F51E9258707758841771C82E9A0D3395C849426F6E93B8A85FE94AB42A00845C
alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031372339E4C8F94AA2
alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031383CEF2534DF870A
alias70170858147E2B26DD5370D9F97113E0D7FDA993A707D5B0304272E93BA9A031717594C273021004
...
Technical Internals ¶
Hash Table ¶
At it's core, OlegDB is just a hashtable. On a good day, this means operations
are O(1)
. Since we use linked lists to handle collisions see here
the worst-case scenario for operations is O(n)
. This usually doesn't happen.
Rehashing happens when we run out of space. To handle this, we currently allocate a new block of memory, rehash all keys and move everything over. This is a blocking operation. If you know you're going to have a lot of keys and want to avoid this, you can tweak the HASH_MALLOC parameter before compilation. This controls the default amount of space that OlegDB will allocate.
Splay Trees ¶
In addition to the hash table, OlegDB also keeps track of currently inserted nodes via a splay tree. Going over the intricacies of splay trees is a little outside the scope of this documentation, but we do use it for several things and for several reasons.
In their simplified form, splay trees are just a specialized form of a
self-balancing binary search tree.
This means that searching for any given key in the tree is an O(log n)
operation and can be done relatively quickly.
In addition to be a binary tree, a splay tree has the property of moving recently inserted keys to the top of the tree. This is known as a splaying operation. While some splay tree implementations splay on read, write and deletion, OlegDB only splays keys to the top of the tree upon insertion and deletion. We figured that, since the splay tree is at most a secondary structure in the Oleg ecosystem, we wanted it's impact to be minimal.
With splay trees installed, we can now iterate through the tree in a timely and efficient manor, and whats more, in a user-decided order.
Binary trees have several modes of traversal that can be useful in a database context. Traversing the tree in-order gives the user the ability to retrieve keys alphabetically, while traversing in a pre-ordered fashion will show the user when the keys were inserted.
Besides key-traversal, splay trees are used for prefix matching. Since binary trees are inherently sorted, we can iterate through one much faster than we could a list.
Splay trees can be turned on/off by changing how you open a database. See ol_feature_flags for a complete list of toggleable parameters.
LZ4 Compression ¶
OlegDB uses the super-fast LZ4 compression algorithm for keeping values on disk in a smaller size while maintaining a low-impact on insertion/deletion.
This is a toggleable feature. See ol_feature_flags for more detail.
AOL File ¶
The Append Only Log file is how Oleg keeps track of state outside of it's values files. Everytime a change occurs to OlegDB, that command is written to the AOL file. This is what allows OlegDB to be persistent.
Every now and then the AOL file needs to be squished (compacted) to remove old and expired data. The AOL file is designed to be human readable (mostly) so you can tell at a glance whats going on with your database.
Values File ¶
The values file augments the AOL file in persisting state to the disk. The values file is basically all of your data, more or less aligned in four megabyte blocks.
Starting in 0.1.2
, this is how we store data on disk. Previously all values
were stored in the AOL file. Instead, the values file is mmap()
'd into RAM
on database startup, allowing you to hold datasets bigger than memory.
liboleg
Macros ¶
VERSION ¶
#define VERSION "0.1.5"
The current version of the OlegDB.
KEY_SIZE ¶
#define KEY_SIZE 250
The hardcoded upperbound for key lengths.
HASH_MALLOC ¶
#define HASH_MALLOC 65536
The size, in bytes, to allocate when initially creating the database. ol_bucket pointers are stored here.
PATH_LENGTH ¶
#define PATH_LENGTH 256
The maximum length of a database's path.
DB_NAME_SIZE ¶
#define DB_NAME_SIZE 64
Database maximum name length.
DEVILS_SEED ¶
#define DEVILS_SEED 666
The seed to feed into the murmur3 algorithm.
VALUES_FILENAME ¶
#define VALUES_FILENAME "val"
The file extension used for the values file on disk.
VALUES_DEFAULT_SIZE ¶
#define VALUES_DEFAULT_SIZE 4194304
The default size of the values file on disk. 4 MB by default.
AOL_FILENAME_ALLOC ¶
#define AOL_FILENAME_ALLOC 512
The number of bytes we allocate for the filename of the AOL file.
AOL_FILENAME ¶
#define AOL_FILENAME "aol"
The file extension used for the AOL file.
Type Definitions ¶
ol_key_array ¶
typedef char **ol_key_array;
This is shorthand for a pointer to an array of keys, the same kind of key stored in an ol_bucket's key[KEY_SIZE]
.
Enums ¶
ol_feature_flags ¶
typedef enum {
OL_F_APPENDONLY = 1 << 0,
OL_F_SPLAYTREE = 1 << 1,
OL_F_LZ4 = 1 << 2,
OL_F_AOL_FFLUSH = 1 << 3,
OL_F_DISABLE_TX = 1 << 4
} ol_feature_flags;
Feature flags tell the database what it should be doing.
OL_F_APPENDONLY: Enable the append only log. This is a write-only logfile for simple persistence.
OL_F_SPLAYTREE: Whether or not to enable to splay tree in the server. This can have a performance impact.
OL_F_LZ4: Enable LZ4 compression.
OL_F_AOL_FFLUSH: Make sure AOL data is REAAAALLY written to disk. This will run fflush after every AOL write. Otherwise, fsync only.
OL_F_DISABLE_TX: Disable transactions on this database.
ol_state_flags ¶
typedef enum {
OL_S_STARTUP = 0,
OL_S_AOKAY = 1,
OL_S_COMMITTING = 2
} ol_state_flags;
State flags tell the database what it should be doing.
OL_S_STARTUP: Startup state. The DB is starting, duh.
OL_S_AOKAY: The normal operating state, the database is a-okay
OL_S_COMMITTING: The database is committing a transaction. It doesn't want to do anything else.
Structures ¶
ol_bucket ¶
typedef struct ol_bucket {
char key[KEY_SIZE]; /* The key used to reference the data */
size_t klen;
size_t data_offset;
size_t data_size;
size_t original_size;
struct ol_bucket *next; /* The next ol_bucket in this chain, if any */
struct tm *expiration;
ol_splay_tree_node *node;
int tx_id;
} ol_bucket;
This is the object stored in the database's hashtable. Contains references to value, key, etc.
key[KEY_SIZE]: The key used for this bucket.
klen: Length of the key.
data_offset: Location of this key's value (data) in the values file on disk.
data_size: Length of the value (data) in bytes. This is the size of the data stored in memory.
original_size: Length of the value (data) in bytes. This is the original length of the data we receieved, non-compressed.
next: Collisions are resolved via linked list. This contains the pointer to the next object in the chain, or NULL.
expiration: The POSIX timestamp when this key will expire.
*node: A pointer to this objects node in the splay tree.
tx_id: If this record is a part of a transaction, then this will be a non-negative integer.
ol_meta ¶
typedef struct ol_meta {
time_t created;
int key_collisions;
} ol_meta;
Structure used to record meta-information about the database.
created: When the database was created.
key_collisions: The number of keys that have collided over the lifetime of this database.
ol_database ¶
typedef struct ol_database {
void (*get_db_file_name)(const struct ol_database *db,const char *p,char*);
void (*enable)(int, int*);
void (*disable)(int, int*);
bool (*is_enabled)(const int, const int*);
char name[DB_NAME_SIZE];
char path[PATH_LENGTH];
char aol_file[AOL_FILENAME_ALLOC];
FILE *aolfd;
int feature_set;
short int state;
int rcrd_cnt;
size_t cur_ht_size;
ol_bucket **hashes;
unsigned char *values;
int valuesfd;
size_t val_size;
ol_splay_tree *tree;
ol_splay_tree *cur_transactions;
ol_meta *meta;
} ol_database;
The object representing a database. This is used in almost every ol_* function to store state and your data.
get_db_file_name: A function pointer that returns the path to the location of the db file to reduce code duplication. Used for writing and reading of dump files.
enable: Helper function to enable a feature for the database instance passed in. See ol_feature_flags
disable: Helper function to disable a database feature. See ol_feature_flags
is_enabled: Helper function that checks weather or not a feature flag is enabled.
name: The name of the database.
path[PATH_LENGTH]: Path to the database's working directory.
aol_file: Path and filename of the append only log.
aolfd: Pointer of FILE type to append only log.
feature_set: Bitmask holding enabled/disabled status of various features. See ol_feature_flags.
state: Current state of the database. See ol_state_flags.
rcrd_cnt: Number of records in the database.
key_collisions: Number of key collisions this database has had since initialization.
created: Timestamp of when the database was initialized.
cur_ht_size: The current amount, in bytes, of space allocated for storing ol_bucket objects.
**hashes: The actual hashtable. Stores ol_bucket instances.
*values: This is where values for hashes are stored. This is a pointer to an mmap()'d region of memory.
valuesfd: The file descriptor of the values file.
val_size: The size of the sum total of records in the db, in bytes. It is not the size of the file on disk.
*tree: A pointer to the splay tree holding the ordered list of keys.
*cur_transactions: The current open/uncommitted transactions. Represented with a splay tree.
*meta: A pointer to a struct holding extra meta information. See oleg_meta for more information.
Functions ¶
ol_open ¶
ol_database *ol_open(const char *path, const char *name, int features);
Opens a database for use.
*path: The directory where the database will be stored.
*name: The name of the database. This is used to create the dumpfile, and keep track of the database.
features: Features to enable when the database is initialized. You can logically OR multiple features together.
Returns: A new database object. NULL on failure.
ol_close ¶
int ol_close(ol_database *database);
Closes a database cleanly and frees memory.
*database: The database to close.
Returns: 0 on success, 1 if not everything could be freed.
ol_unjar ¶
int ol_unjar(ol_database *db, const char *key, size_t klen, unsigned char **data, size_t *dsize);
This function retrieves a value from the database. data must be freed after calling this function! It also writes the size of the data to dsize
. Pass dsize as NULL if you don't care.
*db: Database to retrieve value from.
*key: The key to use.
klen: The length of the key to use.
data: This parameter will be filled out with the data found in the DB. Passing NULL will check if a key exists.
*dsize: Optional parameter that will be filled out with the size of the data, if NULL is not passed in.
Returns: 0 on success, 1 on failure or if the key was not found.
ol_jar ¶
int ol_jar(ol_database *db, const char *key, size_t klen, const unsigned char *value, size_t vsize);
This is OlegDB's canonical 'set' function. Put a value into the mayo (the database). It's easy to piss in a bucket, it's not easy to piss in 19 jars.
*db: Database to set the value to.
*key: The key to use.
klen: The length of the key.
*value: The value to insert.
vsize: The size of the value in bytes.
Returns: 0 on success.
ol_expiration ¶
struct tm *ol_expiration_time(ol_database *db, const char *key, size_t klen);
Retrieves the expiration time for a given key from the database.
*db: Database to set the value to.
*key: The key to use.
klen: The length of the key.
Returns: Stored struct tm *
representing the time that this key will expire, or NULL if not found.
ol_scoop ¶
int ol_scoop(ol_database *db, const char *key, size_t klen);
Removes an object from the database. Get that crap out of the mayo jar.
*db: Database to remove the value from.
*key: The key to use.
klen: The length of the key.
Returns: 0 on success, and 1 or 2 if the object could not be deleted.
ol_uptime ¶
int ol_uptime(ol_database *db);
Gets the time, in seconds, that a database has been up.
*db: Database to retrieve value from.
Returns: Uptime in seconds since database initialization.
ol_spoil ¶
int ol_spoil(ol_database *db, const char *key, size_t klen, struct tm *expiration_date);
Sets the expiration value of a key. Will fail if no ol_bucket under the chosen key exists.
*db: Database to set the value to.
*key: The key to use.
klen: The length of the key.
expiration_date: The UTC time to set the expiration to.
Returns: 0 upon success, 1 if otherwise.
ol_ht_bucket_max ¶
int ol_ht_bucket_max(size_t ht_size);
Does some sizeof
witchery to return the maximum current size of the database. This is mostly an internal function, exposed to reduce code duplication.
*ht_size: The size you want to divide by sizeof(ol_bucket)
.
Returns: The maximum possible bucket slots for db.
ol_prefix_match ¶
int ol_prefix_match(ol_database *db, const char *prefix, size_t plen, ol_key_array *data);
Returns values of keys that match a given prefix.
*db: Database to retrieve values from.
*prefix: The prefix to attempt matches on.
plen: The length of the prefix.
*data: A pointer to an ol_key_array
object where the list of matching keys will be stored. Both the list and it's items must be freed after use.
Returns: -1 on failure and a positive integer representing the number of matched prefices in the database.
ol_key_dump ¶
int ol_key_dump(ol_database *db, ol_key_array *data);
Like ol_prefix_match, except that it takes no prefix and just dumps the entire tree.
*db: Database to retrieve values from.
*data: A pointer to an ol_key_array
object where the list of keys will be stored. Both the list and it's items must be freed after use.
Returns: -1 on failure and a positive integer representing the number of keys in the database.
ol_exists ¶
int ol_exists(ol_database *db, const char *key, size_t klen);
Returns whether the given key exists on the database
*db: Database the key should be in.
*key: The key to check.
klen: The length of the key.
Returns: 0 if the key exists, 1 otherwise.
ol_get_bucket ¶
ol_bucket *ol_get_bucket(const ol_database *db, const char *key, const size_t klen,
char (*_key)[KEY_SIZE], size_t *_klen);
Utility function to retrieve an ol_bucket object from the database for a given key.
*db: Database the bucket should be in.
*key: The key to check.
klen: The length of the key.
**_key: The truncated key will be filled in at this address.
*_klen: The length of the truncated key.
Returns: The bucket if it exists, otherwise NULL.
ol_squish ¶
int ol_squish(ol_database *db);
Compacts both the aol file (if enabled) and the values file. This is a blocking operation.
*db: The database to compact.
Returns: 0 if successful, 1 if otherwise.
ol_cas ¶
int ol_cas(ol_database *db, const char *key, const size_t klen,
unsigned char *value, size_t vsize,
const unsigned char *ovalue, const size_t ovsize);
ol_jar operation that atomically compares-and-swaps old data for new data.
*db: The database to operate on.
*key: The key to check.
klen: The length of the key.
*value: The value to insert.
vsize: The size of the value in bytes.
*ovalue: The old value to compare against.
ovsize: The size of the old value in bytes.
Returns: 0 on success.