Heroic

Heroic is an open-source times-series DBMS built at Spotify.

History

Heroic came out of the need to develop a database that could store more scalable data for its developers, i.e. the company Spotify. The database is somewhat derivative as it is based off of and integrates features from Elasticsearch DB, Cassandra and Bigtable. Heroic has gone through multiple releases, out of which the most significant are documented on its website. While the first GitHub release occurred on April 9th, 2014, we will discuss some of Heroic’s more recent releases here. For example, on April 30th, 2019 is when Elasticsearch was the “dynamic Elasticsearch lookup” feature was added to Heroic, and metrics that Heroics used to quantify semantics were updated to utilize counters. With regards to milestones, milestone 1.0 of heroic was enabling utilization with java 11, as well as including analytics features the heroic developers could use to track how the database is being deployed and used. Another major milestone was denoted as “advanced query usability,” which included enabling users to request more complex queries across java and now python as well, and also provided “better support for querying with a non-JSON DSL.” New querying features such as aggregations were also enabled for users to utilize in this milestone. Lastly, one of the more recent milestones includes “metric storage improvements,” as dubbed by the documentation to include better integration with bigtable tools, a better retention policy, and the inclusion of time series. Thus the database is currently being used largely as a metrics storage and computation tool, and now includes both Google Cloud Bigtable and Apache Cassandra (its initial primary storage mechanism) as its two storage mechanisms.

Data Model

Key/Value

Heroic uses a key/value data model, where each key is comprised of a “unique set of tags and resource identifiers” that correspond to a single series. In this context, we define tags as the database data that can be indexed and will be retained within the database. Additionally, each tag also has its corresponding-time series stored with the data. Tags are thus used in complex queries for both filtering and aggregations, as described by the GitHub Documentation. On the other hand, a Resource Identifier is data that cannot be indexed. However the data itself is still stored with this corresponding-time series. Thus, the purpose of resource identifiers itself is to ensure that data which is constantly changing can still be stored and accessed as per its time-series. As the GitHub documentation gives as example, if the hostname field were to change often, rather than retaining the field, for the purpose of maintaining time-series data as the documentation describes, we would keep hostname as a Resource Identifier and not a tag. As such, resource identifiers are used for querying based off of aggregations.

Stored Procedures

Not Supported

Cassandra, the primary storage model for Heroic does not have stored procedures. Rather, logic is more placed on the application-side, by making a client or application-level program through which users can request to "load and store data" contained inside the Cassandra DB.

Indexes

Inverted Index (Full Text)

The Elasticsearch DB is used by Heroic to Index all of its data. Thus, the indexing structure of heroic mirrors that of Elasticsearch DB, and is an inverted index. The benefits of this type of index is that upon conducting searching, it looks through all possible documents to find unique instances of words, thereby storing each unique words and all the instances in which that word was used. This also enables more contextual searches (i.e. searches which provide the resulting documents as well), and results in faster queries overall.

Storage Model

N-ary Storage Model (Row/Record)

Similar to what was discussed before regarding Cassandra being Heroic’s primary storage mechanism, Heroic also takes on the storage model of Cassandra implying that Heroic has an n-nary storage model as well. An n-nary storage model means that all related data is stored tables where the table has “n” columns, thus defining the n-nary relationship.

Storage Organization

Log-structured

Likewise, the storage organization will also model that of Cassandra’s, being log-structured, that is, utilizing a log structured merge tree. By definition, a log-structured merge tree (LSM) tree is a key-value based tree that performs well with regards to inserting in files to which large quantities of data are inserted. Additionally, LSM trees can have multiple data structures building up the tree that priorities different storage as with the two-level LSM tree where one structure has data from memory and the other has data from disk such that data can flow across the two structures. The data from an LSM tree is sorted into run where each run is sorted by a key. For Cassandra, one key can map to multiple values which correspond to multiple data rows, and thus upon searching the tree we would have to get all corresponding values.

Query Interface

Custom API

Heroic has both a unique Query Language (HQL) as well as a Custom API which users can use. In comparison to JSON queries, HQL was intended to be easier to use and also be formatted in a way such that any complex request that can be conducted using JSON could also be conducted using HQL. As mentioned when describing the indexing model, HQL supports both data aggregations and filtering upon the corresponding time series data, and enables custom filtering a well. With regards to the actual API, Heroic’s GitHub Documentation entails many endpoints similar to that of HTTP requests including “GET, POST, PUT, and DELETE” requests. The API also mentions various types, that is, ways in which the user can utilize the API, and ranges from metric collection, to querying date ranges, to statistics.

Storage Architecture

Disk-oriented

Because Heroic uses Cassandra as its primary form of storage, we will assume that Heroic’s Storage Architecture is modeled off of Cassandra’s as well. Cassandra is a disk-oriented database, as data in Cassandra is stored in the format of columns. However the columns itself are stored on disk. This works such that each column on disk corresponds to a different data feature, from which the columns are comprised represent different data points stored. Additionally, according to the heroic documentation, BigTable can also be used as a as a storage mechanism for heroic, and similar to Apache Cassandra, also has a disk-oriented storage architecture as it uses a distributed file system.