Heroic

Heroic is an open-source time-series DBMS built at Spotify in order to design a database that will more effectively compute and analyze data-metrics across large quantities of data. According to the GitHub documentation, one of the primary features of Heroic DB is that data can be stored in the database for long periods of time. As also noted in the documentation, the database is designed to adapt to and handle increasing quantities data. Heroic also uses the concept of Federated clusters, which enables the database to process user queries via multiple clusters such that the results are merged and outputted via a singular system. Heroic also utilizes Elasticsearch for the indexing and querying process in order to make more complex queries quickly and provide search suggestions for users. And lastly, once the metrics are computed, Heroic interacts with both Google Cloud Bigtable and Apache Cassandra for storage purposes , the latter of which is especially useful when it comes to Heroic’s data retention capabilities.

History

Heroic came out of the need to develop a database that could store more scalable data for its developers, i.e. the company Spotify. The database is somewhat derivative as it is based off of and integrates features from Elasticsearch DB, Cassandra and BigTable. Heroic has gone through multiple releases, out of which the most significant are documented on its website. While the first GitHub commit occurred on June 3rd, 2015, we will discuss some of Heroic’s more recent releases here. For example, on April 30th, 2019 is when an Elasticsearch lookup feature was added to Heroic and Heroic enabled a feature allowing uses to quantify data semantics. Another example of a recent release was when heroic was launched to users using java 11, incorporated analytics that allowed the developers to track how the database is being deployed and used. Furthermore, a following release enabled users to conduct more complex queries across both java and not just using a JSON type language. Through releases, Heroic has updated how users can query such as via aggregations, and even allows users to compute and store better metrics using BigTable, its time-series structure and improved data retention. New querying features such as aggregations were also enabled for users to utilize in this milestone. Thus the database is currently being used largely as a metrics storage and computation tool, and now includes both Google Cloud Bigtable and Apache Cassandra (its initial primary storage mechanism) as its two storage mechanisms.

Storage Model

N-ary Storage Model (Row/Record) Custom

Similar to what was discussed before regarding Cassandra being Heroic’s primary storage mechanism, Heroic also takes on the storage model of Cassandra implying that Heroic has an n-nary storage model as well. An n-nary storage model means that all related data is stored in tables where the table has “n” columns, thus defining the n-nary relationship. However, because Heroic also uses BigTable as a storage mechanism, we also opted to include that Heroic also utilizes Custom Storage Model, as according to the BigTable article, BigTable data is organized via a concept called tablets such that the rows corresponding to different tablets make up one table.

Data Model

Key/Value

Heroic uses a key/value data model, where each key corresponds to a single series, and is represented by unique tags and resource identifiers. Tags are indexable data that will be retained within the database for long periods of time. When each tag is stored, it’s also stored alongside its corresponding time-series. Tags can also be accessed via complex queries requesting both filtering and aggregations, as described by the GitHub Documentation. On the other hand, a Resource Identifier is data that is not indexed, yet is still stored alongside its corresponding-time series. The purpose of resource identifiers itself is to ensure that data which is constantly changing can still be stored and accessed based on its time-series without having to delete significant quantities of data every time its value changes. As the GitHub documentation exemplifies, if the hostname field were to change often, rather than retaining the field, we would keep hostname as a Resource Identifier and not a tag. Unlike tags however, resource identifiers can only be accessed in queries based off of aggregations.

Stored Procedures

Not Supported

Cassandra, the primary storage model for Heroic does not have stored procedures. Rather, users have to develop application-based programs through which they can access and manipulate the database data. As per the BigTable article as well, stored procedures are not supported by BigTable either.

Query Interface

Custom API

Heroic has both a unique Query Language (HQL) as well as a Custom API which users can use. In comparison to JSON, HQL was intended to be easier to use language that still matches the complexity and structure of JSON-based queries. As mentioned when describing the indexing model, HQL supports both data aggregations and custom filtering based on the data’s corresponding time series. With regards to the actual API, Heroic’s GitHub Documentation entails many endpoints similar to that of HTTP requests such as GET and POST. The API also mentions various types, that is, ways in which the user can utilize the API, which also range from metric collection, to querying date ranges, to statistics.

Storage Architecture

Disk-oriented

Because Heroic uses Cassandra as its primary form of storage, we will assume that Heroic’s Storage Architecture is modeled off of Cassandra’s. Cassandra is a disk-oriented database, as data in Cassandra is stored in columns such that the columns itself are stored on disk. This works as each column on disk corresponds to a different data feature, and each row made up these features represents one data tuple or point. Additionally, according to the Heroic documentation, BigTable can also be used as a storage mechanism for Heroic, and similar to Apache Cassandra, also has a disk-oriented storage architecture as it is based on a distributed file system.

Storage Organization

Log-structured

Likewise, the storage organization will also model that of Cassandra’s, where Apache Cassandra has a log-structured model as it utilizes a log structured merge tree. By definition, a log-structured merge tree (LSM) tree is a key-value based tree that is optimized for inserts in files where large quantities of data are inserted. Additionally, LSM trees can be comprised of multiple data structures building up the tree, where each data structure prioritizes a different storage level. For example, with the two-level LSM tree, one structure has data from memory and the other has data from disk such that data can still flow across the two structures. Any data in the LSM tree that is stored at the disk level is sorted into runs, where each run is sorted by its corresponding keys. For Cassandra, one key can map to multiple values, where each value is a data row. Furthermore, as described above with regards to the BigTable tablet storage model, because each tablet is also structured in the form of a LSM tree, the BigTable Storage aspect of Heroic's storage is also log-structured.

Indexes

Inverted Index (Full Text)

The Elasticsearch DB is used by Heroic to Index all of its data. Thus, the indexing structure of heroic mirrors that of Elasticsearch DB, and is an inverted index. The benefits of this type of index is that upon conducting searching for a query, it looks through all possible instances to find unique instances of the words in that query, and stores each unique word and all the instances in which that word was used within its index. This also enables more contextual searches (i.e. searches which provide the resulting documents as well), and results in faster queries overall.