Heroic

Heroic is an open-source time-series DBMS built at Spotify to more effectively compute and analyze data-metrics across large quantities of data. According to the GitHub documentation, one of the primary features of Heroic DB is that data can be stored in the database for long periods of time. As also noted in the documentation, the database is designed to adapt to and handle increasing quantities of data. Heroic also uses the concept of Federated clusters, which enable the database to process user queries via multiple clusters such that the results are merged and outputted via a single system. Heroic also utilizes Elasticsearch DB for the indexing and querying processes in order to output complex queries more quickly and provide search suggestions for users. And lastly, once the metrics are computed, Heroic interacts with both Google Cloud Bigtable and Apache Cassandra for storage purposes , the latter of which is especially useful when it comes to Heroic’s long-term data storage capabilities.

History

Heroic came out of the need to develop a database that could store more scalable quantities of data for its developers, i.e. the company Spotify. The database is somewhat derivative as it is based off of and integrates features from Elasticsearch DB, Cassandra and BigTable. Heroic has gone through multiple releases, out of which the most significant are documented on its website. While the first GitHub commit occurred on June 3rd, 2015, we will discuss some of Heroic’s more recent releases here. For example, on April 30th, 2019 is when an Elasticsearch lookup feature was added and Heroic also enabled a feature allowing users to quantify data semantics. Another example of a recent release was when Heroic incorporated analytics that allowed the developers to track how the database is being deployed and used. Furthermore, a following release enabled users to conduct more complex queries across both java and python and not just using JSON type languages. Through newer releases, Heroic has updated how users can conduct queries such as via aggregations, and even allowed users to better compute and store metrics using BigTable. Thus the database is currently being used largely as a metrics storage and computation tool, and now includes both Google Cloud Bigtable and Apache Cassandra (its primary storage mechanism) as its two storage mechanisms.

Indexes

Inverted Index (Full Text)

The Elasticsearch DB is used by Heroic to Index all of its data. Thus, the indexing structure of Heroic mirrors that of Elasticsearch DB, and is an inverted index. The benefits of this type of index is that upon conducting the search for a query, it looks through all possible locations to find all instances of the words in that query, and stores each unique word alongside all the instances in which that word was used. This enables more contextual searches (i.e. searches which provide the resulting documents as well), and results in faster queries overall.

Stored Procedures

Not Supported

Cassandra, the primary storage model for Heroic does not have stored procedures. Rather, users have to develop application-based programs through which they can access and manipulate the database data. As per the BigTable article as well, stored procedures are not supported by BigTable either.

Query Compilation

Code Generation

Because Elasticsearch conducts the querying process for Heroic DB, we will assume that the query compilation is done via Elasticsearch's mechanism as well. As such, Heroic DB also has a code generation query compilation mechanism. However, due to Heroic's user-friendly HQL language, users can still receive outputs formatted similar to JSON results via JSON-based queries (as described below) even when not using JSON compatible languages, as indicated in the "history" section as one of Heroic's advancements.

Compression

Prefix Compression

Although HeroicDB itself has not enabled its own compression, users can conduct compression for heroic through BigTable, which uses two custom compression algorithms called BMDiff and Zippy to conduct prefix compression such that the keys are compressed via prefix compression, where each resulting key consists of the row,column location as well as the value's timestamp. Meanwhile, the the value consists of compressed value's corresponding column name as well as its BMDiff result. After getting this pair, the BigTable compression algorithm then runs the Zippy algorithm in order to further compress the initial values (now key value pairs) and minimize repeated entries.

Query Execution

Materialized Model

Because Heroic DB's queries are conducted through Elasticsearch (despite having its own Query language as discussed below), the Query Execution Model will also mirror that of Elasticsearch. Elasticsearch uses a Materialized model, where the Elasticsearch data itself is organized into shards such that any query will have to search the data's relevant shards based on the data's "index" and "type" fields. From there, Elasticsearch looks through all possible shards based on those fields and checks for documents which match the data in each, combining all of the final results to produce a sorted output page.

Storage Architecture

Disk-oriented

Because Heroic uses Cassandra as its primary form of storage, we will assume that Heroic’s Storage Architecture is modeled off of Cassandra’s. Cassandra is a disk-oriented database, as data in Cassandra is stored in columns such that the columns itself are stored on disk. Each column on disk corresponds to a different data feature, and each row made up by these features represents one data tuple or point. Additionally, according to the Heroic documentation, BigTable can also be used as a storage mechanism for Heroic, and similar to Apache Cassandra, also has a disk-oriented storage architecture as it is based on a distributed file system.

Logging

Physiological Logging

Heroic uses SLF4J technique to enable users to utilize whichever logging framework they wish. Heroic DB uses the log4j framework for SLF4J, with the log4j framework. The log4j framework incorporates asynchronous logging such that it flushes logs to disk in batches rather than each occurring immediately. The documentation for apache log4j 2 states that each and every log event is documented. However, the logs do not include a full before and after image of the change which was applied, and thus we glean that Heroic DB uses physiological logging to store all the changes without an extensive amount of detail per log.

Storage Model

N-ary Storage Model (Row/Record) Custom

Similar to what was discussed before regarding Cassandra being Heroic’s primary storage mechanism, Heroic also takes on the storage model of Cassandra implying that Heroic has an n-nary storage model as well. An n-nary storage model means that all related data is stored in tables where the table has “n” columns, thus defining the n-nary relationship. However, because Heroic also uses BigTable as a storage mechanism, we also opted to include that Heroic also utilizes Custom Storage Model, as according to the BigTable article, BigTable data is organized via a concept called tablets such that the rows corresponding to different tablets make up one table.

Data Model

Key/Value

Heroic uses a key/value data model, where each key corresponds to a single series, and is represented by unique tags and resource identifiers. Tags are indexable data that will be retained within the database for long periods of time. When each tag is stored, it’s also stored alongside its corresponding time-series. Tags can also be accessed via complex queries requesting both filtering and aggregations, as described by the GitHub Documentation. On the other hand, a Resource Identifier is data that is not indexed, yet is still stored alongside its corresponding-time series. The purpose of resource identifiers itself is to ensure that data which is constantly changing can still be stored and accessed based on its time-series without having to delete significant data every time its value changes. As the GitHub documentation exemplifies, if the hostname field were to change often, rather than retaining the field, we would keep hostname as a Resource Identifier and not a tag. Unlike tags however, resource identifiers can only be accessed in queries based off of aggregations.

Storage Organization

Log-structured

Likewise, the storage organization also models that of Cassandra’s, where Apache Cassandra has a log-structured model as it utilizes a log structured merge tree. By definition, a log-structured merge tree (LSM) tree is a key-value based tree that best used for inserts in files where large quantities of data are inserted. Additionally, LSM trees can be comprised of multiple data structures building up the tree, where each data structure prioritizes a different storage level. For example, with the two-level LSM tree, one structure has data from memory and the other has data from disk such that data can still flow across the two structures. Any data in the LSM tree that is stored at the disk level is sorted into runs sorted by its corresponding keys. For Cassandra, one key can map to multiple values, where each value is a data row. Furthermore, as described above with regards to the BigTable tablet storage model, because each tablet is also structured in the form of a LSM tree, the BigTable Storage aspect of Heroic's storage is also log-structured.

Isolation Levels

Not Supported

Heroic DB does not support transaction concepts or tracking changes made to the data regardless of whether the data has been committed of if conflicts exist at all.

Query Interface

Custom API

Heroic has both a unique Query Language (HQL) as well as a Custom API which users can use. In comparison to JSON, HQL was intended to be an easier to use language that still matches the complexity and structure of JSON-based queries. As mentioned when describing the indexing model, HQL supports both data aggregations and custom filtering based on the data’s corresponding time series. With regards to the actual API, Heroic’s GitHub Documentation entails many endpoints similar to that of HTTP requests, such as GET and POST. The API also mentions various types, that is, ways in which the user can utilize the API, ranging from metric collection, to querying date ranges, to statistics.

Joins

Not Supported

Given that the storage databases for Heroic DB, CassandraDB and Cloud BigtableDB, do not support joins, HeroicDB will also not support joins as the joins would have to be conducted via the databases storage models.