TrailDB is an easy portable C library that allows querying a series of relative events. It is used to group the existing relative events in a time-series format and produce an immutable database with high compression rate. It is designed as a complement to current existing relational databases or key-value stores and targeted for OLAP workload such as analyzing usage patterns, predicting user behavior, and detecting anomalies. One key design feature is that the database is immutable once produced. This immutability feature allows the TrailDB to reach another key feature - data compression. It leverages relativity among time-series events to achieve high compression. These two key features allow TrailDB to achieve good performance in OLAP workload.
TrailDB system has been developed by developers in AdRoll to handle the user-level analytics. The developers at AdRoll has seen the increasing number of requests for user-level analytics but found it hard to grasp and analyze a large user-level data using advanced SQL queries. Thus, they developed TrailDB system that is used to query a series of events at user-level. TrailDB 0.5. This was the first version announced open-sourced on May 24, 2016. TrailDB 0.6. This version was released on May 15, 2017. It introduced the **indexes** and brought a new optimized query tool **trck**.
TrailDB system does not support checkpoints as each database is immutable once produced.
First, within a trail, events are always sorted by time. Thus, it utilizes Delta Encoding to compress the 64-bit timestamps. Second, since events are grouped by UUID, which usually represents a logical entity such as an online shopping customer, these events within a trail tend to be predictable and TrailDB only encodes every change in behavior. This is not exactly the same as the Run-Length Encoding but similar. Third, Huffman Coding, which is a kind of Prefix Compression method, is used to encode the skewed, low-entropy distributions of values.
As each TrailDB is an immutable file, modifications are not allowed. There's only one process to produce a database and no one can issue read operations before the creation is finalized. Thus, concurrency is not needed in TrailDB system.
TrailDB system adopts a specific relational data model. The traditional relational data model consists of a key and a set of different attributes. In TrailDB system, it consists of a key called **UUID** and a list of objects consists of values of a set of pre-defined fields. TrailDB system defines a thing called **trail** that is uniquely identified by a **UUID**. Within each **trail**, there is a list of ordered **events**, each of which is identified and ordered by the **timestamp**. For each **event**, it contains values for the pre-defined set of **fields**. These **fields** are similar to attributes in the traditional relational data model. This data model allows the relative events belonging to one **UUID**, taking one online shopping user as an example, to group together in the order of time. Thus, it offers the predictability feature among the list of events and enables TrailDB system developers to use several compression methods to achieve high compression rate and extraction speed in TrailDB.
In TrailDB, each database consists of a collection of **trails** each of which is identified by a unique UUID. There are no multiple tables within a database and no constraints among databases. Thus, it does not support foreign keys.
The index feature was introduced in TrailDB 0.6. It uses a specific inverted index to map each **item** to a list of page ids. Each **item** is uniquely identified by a **field** and the value in that **field**. TrailDB system provides the indexes by mapping each TrailDB **item** to a list of page ids that contains that **item**. There is a file contains a **HEADER** and **FIELD SECTION**. TrailDB system looks into the **HEADER** first to get the filed's corresponding beginning offset of **FIELD SECTION**. Then, it finds out the corresponding **item** and extracts the page ids containing that **item**.
When creating a database, there's only one process to handle it and others cannot access it. Once the database is produced, it is a read-only immutable file. Thus, everyone can issue read requests to it, but cannot issue any write operations. In this point of view, it is equivalent to the serializable isolation level.
TrailDB system offers the APIs to allow a join operation on multiple **trails**. Within each **trail**, **events** are already sorted in timestamp order. TrailDB system leverages this feature and adopts the merge sort of multiple **trails** to produce one single merged **trail** with a list of sorted **events**.
TrailDB does not support logging and there's only one process to create the database. There is no recovery handler if the process crashes during the creation of the database. Thus, users need to start from the very beginning of the producing process. But, TrailDB system allows merging existing TrailDBs to create a new immutable database. It is suggested to do so if there's a huge number of input events.
TrailDB system offers custom APIs to allow users to query **events** with cursors. It can emit one **event** each time with one cursor or multiple **events** with multiple cursors. There are next functions provided to move the cursor(s) to the next event(s) in the trail(s). The current version offers three next functions. They are [tdb\_cursor\_next], [tdb\_multi\_cursor\_next], and [tdb\_multi\_cursor\_next\_batch]. : http://traildb.io/docs/api/#tdb_cursor_next : http://traildb.io/docs/api/#tdb_multi_cursor_next : http://traildb.io/docs/api/#tdb_multi_cursor_next_batch
TrailDB system offers its own custom APIs instead of using the standard SQL query interface. TrailDB is designed to make a specific organization of the user-level events and it is born not fit for SQL. Instead, It offers the query interface in several programming languages: C, Go, Python, R, Haskell, and D. TrailDB system also provides two specific languages to query the data. One is called **trck**, which is a domain specific language to aggregate metrics based on events of identical **UUID**. The other is called **Reel**, a small general query language for TrailDB.
Each TrailDB is a read-only immutable file, it does not support stored procedures.
The TrailDB adopts the embedded database architecture. Each database created by TrailDB system is an immutable file. Thus, everyone can have a copy of this database and access it using the custom API or specific query languages as a standalone application. There's no administrator needed for this database.
TrailDB system does not support views. But, as each database is an immutable file, users can create "views" by creating another immutable database by extracting data from the existing TrailDBs.
Commercial, Open Source
C, D, Go, Haskell, Python, R