Heroic is an open-source time-series DBMS built at Spotify to more effectively compute and analyze data-metrics across large quantities of data. According to the GitHub documentation, one of the primary features of Heroic DB is that data can be stored in the database for long periods of time. As also noted in the documentation, the database is designed to adapt to and handle increasing quantities of data. Heroic also uses the concept of Federated clusters, which enable the database to process user queries via multiple clusters such that the results are merged and outputted via a single system. Heroic also utilizes Elasticsearch DB for the indexing and querying processes in order to output complex queries more quickly and provide search suggestions for users. And lastly, once the metrics are computed, Heroic interacts with both Google Cloud Bigtable and Apache Cassandra for storage purposes, the latter of which is especially useful when it comes to Heroic’s long-term data storage capabilities.
Heroic came out of the need to develop a database that could store more scalable quantities of data for its developers, i.e. the company Spotify. The database is somewhat derivative as it is based off of and integrates features from Elasticsearch DB, Cassandra and BigTable. Although it is still in development mode, and not completely safely ready for public use, Heroic has gone through numerous revisions since June 3rd, 2015 when the first GitHub Commit occurred. More recently on April 30th, 2019 is when Heroic evolved to include the Elasticsearch lookup feature and enable users to quantify their data’s semantics. Over time,Heroic also grew to incorporate analytics which allowed the developers to track how the database is being deployed and used. Striving to be a user-friendly database, Heroic added a feature which could conduct more complex queries across both java and python and not just using JSON type languages. Thus, in its current status, Heroic is a metrics storage and computation tool which includes both Cassandra DB and now also Google Cloud Bigtable for storage purposes as well as Elasticsearch DB for indexing. Thus, the database has shaped the way users can conduct queries (such as via aggregations), and using BigTable, also enables them to better compute and store metrics.
Although HeroicDB itself does not have compression enabled, users can conduct compression for Heroic through BigTable as BigTable is used as one of the storage systems for Heroic DB. BigTable uses two compression algorithms called BMDiff and Zippy. For each (key, value) pair corresponding to each data point, the key is generated by compressing the data into a tuple consisting of the row, column location of the data as well as its corresponding timestamp. The value corresponding to this compressed key is generated through the BMDiff algorithm, where data points are organized with respect to their columns and BMDiff is run for each respective column type. After getting this pair, the BigTable then runs the Zippy algorithm in order to further compress the initial data points (now key value pairs) and minimize repeated entries.
Heroic uses a key/value data model, where each key corresponds to a single series, and is represented by unique tags and resource identifiers. Tags are indexable data that will be retained within the database for long periods of time. When each tag is stored, it’s also stored alongside its corresponding time-series. Tags can also be accessed via complex queries requesting both filtering and aggregations, as described by the GitHub Documentation. On the other hand, a Resource Identifier is data that is not indexed, yet is still stored alongside its corresponding-time series. The purpose of resource identifiers itself is to ensure that data which is constantly changing can still be stored and accessed based on its time-series without having to delete significant data every time its value changes. As the GitHub documentation exemplifies, if the hostname field were to change often, rather than retaining the field, we would keep hostname as a Resource Identifier and not a tag. Unlike tags however, resource identifiers can only be accessed in queries based off of aggregations.
The Elasticsearch DB is used by Heroic to Index all of its data. Thus, the indexing structure of Heroic mirrors that of Elasticsearch DB, and is an inverted index. The benefits of this type of index is that upon conducting the search for a query, it looks through all possible locations to find all instances of the words in that query, and stores each unique word alongside all the instances in which that word was used. This enables more contextual searches (i.e. searches which provide the resulting documents as well), and results in faster queries overall.
Heroic uses the SLF4J technique to enable whichever logging framework users wish to use. Heroic DB specifically uses the Apache log4j framework, which enables users to choose amongst asynchronous and synchronous logging, where asynchronous logging flushes logs to disk in batches rather than immediately every time a log occurs. The documentation for apache log4j 2 states that each and every log event is documented. However, the logs itself do not include a full before and after image of the change which was applied, and thus we glean that Heroic DB uses physiological logging to store all the changes without having an extensive amount of detail per log.
Because Elasticsearch conducts the searching process for Heroic DB queries, we will assume that the query compilation is done via Elasticsearch's compilation mechanism as well. As such, we claim that Heroic DB also has a code generation query compilation mechanism. However, due to Heroic's user-friendly HQL language, users can receive outputs formatted similar to JSON results by making JSON-format like queries (as described in the query interface section) even when not using JSON compatible languages due to one of Heroic’s more recent release advancements.
Because Heroic DB's search queries are conducted through Elasticsearch, Heroic’s Query Execution Model will also mirror that of Elasticsearch. Elasticsearch uses a Materialized model, where the Elasticsearch data itself is organized into shards such that any query will have to search all possible shards and check each one for matches, combining all of the final results to produce a sorted output page. This materialized view strategy for Heroic is also verified as according to Heroic documentation, Heroic DB incorporates the concept of sharding to make data partitions.
Heroic has both a unique Query Language (HQL) as well as a Custom API which users can use. In comparison to JSON, HQL was intended to be an easier to use language that still matches the complexity and structure of JSON-based queries. As mentioned when describing the indexing model, HQL supports both data aggregations and custom filtering based on the data’s corresponding time series. With regards to the actual API, Heroic’s GitHub Documentation entails many endpoints similar to that of HTTP requests, such as GET and POST. The API also mentions various types, that is, ways in which the user can utilize the API, ranging from metric collection, to querying date ranges, to statistics.
Because Heroic uses Cassandra as one of its storage forms, we will assume that Heroic’s Storage Architecture is modeled off of Cassandra’s. Cassandra is a disk-oriented database, as data in Cassandra is stored in columns such that the columns itself are stored on disk. Each column on disk corresponds to a different data feature, and each row made up by these features represents one data tuple or point. Additionally, according to the Heroic documentation, BigTable can also be used as a storage mechanism for Heroic, and similar to Apache Cassandra, also has a disk-oriented storage architecture as it is based on a distributed file system.
N-ary Storage Model (Row/Record) Custom
Similar to what was discussed before regarding Cassandra being one of Heroic’s storage mechanisms, Heroic also takes on the storage model of Cassandra implying that Heroic has an n-nary storage model as well. An n-nary storage model means that all related data is stored in tables where the table has “n” columns defining the n-nary relationship. However, because Heroic also uses BigTable as a storage mechanism, we also opted to include that Heroic also utilizes Custom Storage Model, as according to the BigTable article, BigTable data is organized via a concept called tablets such that the rows corresponding to different tablets make up one table.
Likewise, the storage organization also models that of Cassandra’s, where Apache Cassandra has a log-structured model as it utilizes a log structured merge tree. By definition, a log-structured merge tree (LSM) tree is a key-value based tree that best used for inserts in files where large quantities of data are inserted. Additionally, LSM trees can be comprised of multiple data structures building up the tree, where each data structure prioritizes a different storage level. For example, with the two-level LSM tree, one structure has data from memory and the other has data from disk such that data can still flow across the two structures. Any data in the LSM tree that is stored at the disk level is sorted into runs sorted by its corresponding keys. For Cassandra, one key can map to multiple values, where each value is a data row. Furthermore, as described above with regards to the BigTable tablet storage model, because each tablet is also structured in the form of a LSM tree, the BigTable Storage aspect of Heroic's storage is also log-structured.
Cassandra, one of the storage models for Heroic, does not have stored procedures. Rather, users have to develop application-based programs through which they can access and manipulate the database data. As per the BigTable article as well, stored procedures are not supported by BigTable either.
Once again, as Cassandra and Cloud BigTable make up Heroic DB's data storage mechanisms, we claim that Heroic DB's system architecture also mirrors that of the aforementioned databases. As mentioned in the Cassandra writeup, Cassandra utilizes a shared-nothing architecture as the database is divided into shards, each of which each is responsible for its respective data. However, the datasets corresponding to each shard itself are not unique as if this were the case, a failure in any single node would result in unrecoverable data. Additionally, from the GitHub documentation, we know that Heroic DB incorporates the concept of federated clusters and divides the database such that each of the database's shards have ownership over its respective data (where the data itself is replicated across multiple nodes).
https://spotify.github.io/heroic/
https://github.com/spotify/heroic
https://spotify.github.io/heroic/#!/docs/overview
Spotify
2015
2019
Cassandra, Cloud BigTable, Elasticsearch, ksqlDB