Heroic

View Current Viewing Revision #24 from 11/24/2019 4:02 p.m.

Heroic is an open-source time-series DBMS built at Spotify in order to design a database that will more effectively compute and analyze data-metrics across large quantities of data. According to the GitHub documentation, one of the primary features of Heroic DB is that data can be stored in the database for long periods of time. As also noted in the documentation the database has a “scalable architecture” and is why its able to better analyze big data. Heroic does so via the concept of Federated clusters, which enables the database to process user queries via multiple clusters such that the results are merged by one interface. Heroic also utilizes Elasticsearch for the indexing and querying process in order to make more complex queries quickly and provide search suggestions for users. And lastly, Once the metrics are computed, Heroic interacts with both Google Cloud Bigtable and Apache Cassandra for storage purposes , the latter of which is especially useful when it comes to Heroic’s data retention capabilities.

History

Heroic came out of the need to develop a database that could store more scalable data for its developers, i.e. the company Spotify. The database is somewhat derivative as it is based off of and integrates features from Elasticsearch DB, Cassandra and BigTable. Heroic has gone through multiple releases, out of which the most significant are documented on its website. While the first GitHub release occurred on April 9th, 2014, we will discuss some of Heroic’s more recent releases here. For example, on April 30th, 2019 is when Elasticsearch was the “dynamic Elasticsearch lookup” feature was added to Heroic and Heroic enabled a feature allowing uses to quantify data semantics. Another example of a recent release was when heroic was launched to users using java 11, incorporated analytics that allowed the developers to track how the database is being deployed and used. Furthermore, the “advanced query usability” release enabled users to conduct complex queries across both java and not just using a JSON type language. Through releases, Heroic has updated how users can query such as via aggregations, and even allows users to compute and store better metrics using BigTable, its time-series structure and improved data retention. New querying features such as aggregations were also enabled for users to utilize in this milestone. Thus the database is currently being used largely as a metrics storage and computation tool, and now includes both Google Cloud Bigtable and Apache Cassandra (its initial primary storage mechanism) as its two storage mechanisms.

Stored Procedures

Not Supported

Cassandra, the primary storage model for Heroic does not have stored procedures. Rather, logic is more placed on the application-side, by making a client or application-level program through which users can request to "load and store data" contained inside the Cassandra DB. As per the BigTable article as well, Stored Procedures are not supported by BigTable either.

Query Interface

Custom API

Heroic has both a unique Query Language (HQL) as well as a Custom API which users can use. In comparison to JSON queries, HQL was intended to be easier to use and was formatted in a way such that any complex request that could be conducted using JSON could also be conducted using HQL. As mentioned when describing the indexing model, HQL supports both data aggregations and custom filtering based on the data’s corresponding time series. With regards to the actual API, Heroic’s GitHub Documentation entails many endpoints similar to that of HTTP requests including “GET, POST, PUT, and DELETE.” The API also mentions various types, that is, ways in which the user can utilize the API, and ranges from metric collection, to querying date ranges, to statistics.

Storage Organization

Log-structured

Likewise, the storage organization will also model that of Cassandra’s, being log-structured, that is, utilizing a log structured merge tree. By definition, a log-structured merge tree (LSM) tree is a key-value based tree that performs well with regards to inserting in files to which large quantities of data are inserted. Additionally, LSM trees can have multiple data structures building up the tree that priorities different storage as with the two-level LSM tree where one structure has data from memory and the other has data from disk such that data can flow across the two structures. The data from an LSM tree is sorted into run where each run is sorted by a key. For Cassandra, one key can map to multiple values which correspond to multiple data rows, and thus upon searching the tree we would have to get all corresponding values. Furthermore, as described above with regards to the BigTable tablet storage model, as each tablet is structured in the form of a LSM tree as well, the BigTable Storage aspect of Heroic's storage is also log-structured

Foreign Keys

Not Supported

Indexes

Inverted Index (Full Text)

The Elasticsearch DB is used by Heroic to Index all of its data. Thus, the indexing structure of heroic mirrors that of Elasticsearch DB, and is an inverted index. The benefits of this type of index is that upon conducting searching for a query, it looks through all possible instances to find unique instances of the words in that query, and stores each unique word and all the instances in which that word was used within its indexes. This also enables more contextual searches (i.e. searches which provide the resulting documents as well), and results in faster queries overall.

Data Model

Key/Value

Heroic uses a key/value data model, where each key corresponds to a single series, and is represented by unique tags and resource identifiers. Tags are indexable data that will be retained within the database for long periods of time. When each tag is stored, it’s also stored alongside its corresponding time-series. Tags are also complex queries requesting both filtering and aggregations, as described by the GitHub Documentation. On the other hand, a Resource Identifier is data is not indexed, yet is still stored alongside its corresponding-time series. The purpose of resource identifiers itself is to ensure that data which is constantly changing can still be stored and accessed based on time-series, without having to delete significant quantities of data every time its value changes. As the GitHub documentation exemplifies, if the hostname field were to change often, rather than retaining the field, we would keep hostname as a Resource Identifier and not a tag. Unlike tags however, resource identifiers are used only for querying based off of aggregations.

Storage Model

N-ary Storage Model (Row/Record) Custom

Similar to what was discussed before regarding Cassandra being Heroic’s primary storage mechanism, Heroic also takes on the storage model of Cassandra implying that Heroic has an n-nary storage model as well. An n-nary storage model means that all related data is stored tables where the table has “n” columns, thus defining the n-nary relationship. However, because Heroic also uses BigTable as a storage mechanism, we also opted to include that Heroic has a Custom Storage Model, as according to the BigTable article, BigTable data is organized via a concept called "tablets" such that rows in multiple tablets make up a table. BigTable also enables users to create "SSTables" from locality groups, which are essentially columns grouped together in a table

Storage Architecture

Disk-oriented

Because Heroic uses Cassandra as its primary form of storage, we will assume that Heroic’s Storage Architecture is modeled off of Cassandra’s as well. Cassandra is a disk-oriented database, as data in Cassandra is stored in the format of columns. However the columns itself are stored on disk. This works such that each column on disk corresponds to a different data feature, from which the columns are comprised represent different data points stored. Additionally, according to the heroic documentation, BigTable can also be used as a as a storage mechanism for heroic, and similar to Apache Cassandra, also has a disk-oriented storage architecture as it uses a distributed file system.