Beringei is an embedded in-memory storage engine for time series data developed by Facebook. It was developed by Facebook to handle the growing scale of their system health monitoring data, as they needed a faster solution to storing constant stream of time series data, as well as a fast way to query this data.
The original name of the system was called Gorilla. The architecture was described in a VLDB 2015 research paper. The project was then open-sourced in 2017 and then discontinued in 2018. The original solution to handling monitoring data was an HBase-backed TSDB (Time series database), but the primary issues revolved around slow aggregate queries, and slow writes.
Beringei's primary data structure is a TSmap, or a Timeseries Map. It uses a C++ vector of shared-pointers to the time series names, and a case-insensitive map of time series names to the time series. The shared pointers allow scans to copy the vector quickly, while the map allows constant time lookup of a specific time series.
Facebook has a custom developed query service that runs in its Operational Data Store, ODS, and within Beringei, queries are extremely low latency. They have also developed a number of analysis tools made possible by said low latency read queries.
Facebook developed a time-series correlation engine for Beringei. It allows for users to search multiple time series in an interactive manner, currently limited to 1 million at a time. It calculates the Pearson Product Moment Correlation Coefficient (PPMCC), cross comparing a single time series against a set of time series. This aid's the answering of questions of causation, such as "What happened around the time my service broke" (direct quote from VLDB research paper)
Due to the low latency read queries, data visualizations that involve large sets of data can be created quickly, and allows users to visualize scans across large datasets. This makes it easy to spot outliers and anomalies.
Aggregations Beringei periodically runs a background process that scans the database every two hours, generating new aggregate values. This helps reduce load on the HBase cluster.
There is a single read-write spin lock that protects the TSmap and shared-pointer vector in the memory, and then a 1-byte spin lock is used for each time series. There is low contention between reads and writes since each individual series has low write throughput.
Beringei shard's monitoring data based on unique string keys, so that each time-series dataset can be mapped to a single host. Scaling of the database is simple since it has a share-nothing architecture.
Beringei can handle single node failures, network cuts, and entire datacenter failures by writing each time series value synchronously to two different hosts in two separate geographic regions. When a fault is detected, all read queries in one region are mapped to the other region so there is no disruption from the user end.
Beringei uses a checkpoint file to mark when complete block files are fully flushed to disk. When a file is complete, the database will mark this in the checkpoint and delete the corresponding logs. In the case of a failure to flush, such as a system crash, the checkpoint file will not exist when the system restarts, indicating that the block file may be corrupt, and it will read from the log instead.
Beringei features a time series compression algorithm that on average compresses each time series by 12x, without loss of resolution as some lossy compression algorithms would do, and works with double precision floating point values that are present in monitoring time series data.
The algorithm was derived from the floating point data compression scheme in scientific computation, that leverages XOR's with previous values to generate a delta encoding. Beringei does not do additional compression across the time series outside of this delta encoding. Each data point in the time series, the time stamp and the value, are represented by a pair of 64 bit values. Each timestamp/value is compressed separately, by using XOR's with the previous timestamp/value. The crux of the idea is that once we have stored the initial value, values following can XOR themselves with the previous value, and obtain the number of significant bits in difference, which is then encoded with a header that serializes the difference.
Time Stamp Compression
Beringei uses a "delta of deltas" compression for time stamps, as most time stamp data points are sent at a fixed interval. This allows the database to store the difference in time stamp, rather than the whole time stamp. It uses a simple schema for variable length encoding of time stamps after storing the initial time stamp, and the overall result is that 96% of time stamps can be compressed to a single bit.
Beringei compresses values in addition to time stamps. It uses the aformentioned XOR floating point compression with a variable length encoding scheme to store its values.
As Beringei does not offer ACID guarantees, there is no write-ahead-log. Instead, data is buffered up to 64kB and then flushed. This is typically 1-2 seconds of data, which was a designed tradeoff so that if the system crashed only a small amount of data would be lost.
Every two hours, the compressed block data is also copied to desk, as the compression makes it smaller than the log files. One complete block is produced for every two hours worth of data, structured as a set of 64kB data blocks, copied directly from memory, and a list of (time series ID, data block pointer) pairs.