TileDB is a storage engine designed to support the storage and access of both dense and sparse multi-dimensional arrays. The key idea of TileDB is that it stores array elements into collections called fragments, which can be either dense or sparse. Each of these fragments stores data in data tiles. In the case of dense fragments, the capacity of data tiles is limited by a fixed chunk size. In the case of sparse fragments, the capacity of data tiles is limited by a fixed element size. TileDB also supports parallel I/O and is completely multi-threaded.
TileDB is designed to store many different types of data, such as genomic data, machine learning model parameters, imaging data, and LiDaR data.
TileDB was invented at the Intel Science and Technology Center for Big Data. The research center was a collaboration between Intel Labs and MIT. The research project was published in a VLDB 2017 paper. TileDB, Inc. was founded in February 2017 to further develop and maintain the DBMS.
Dictionary Encoding Delta Encoding Run-Length Encoding
TileDB supports the following compressors: bzip2, dictionary, double-delta, gzip, LZ4, RLE, and Zstandard. It also supports a few data filters that usually function as compressors, such as the bit width reduction filter, float scaling filter, positive delta filter, and WebP filter. We detail the custom compressors in the section below: - The double delta compressor is a compressor that is similar to Facebook's Gorilla system. However, TileDB's compressor uses a fixed bit-size instead of a variable bit-size. - The dictionary encoding filter is a lossless compressor that takes a dictionary of all the unique strings in the input data and stores the indexes of the dictionary instead of the strings themselves in memory. - The bit width reduction filter takes in input data with an unsigned integer type and compresses them to a smaller bit width if possible. - The float scaling filter is a lossy compressor takes in input data with a floating point type. Along with arguments for a scale factor, an offset factor, and a byte width, the filter computes round((input_data[i] - offset) / scale), casts it to an integer type with the specified byte width, and stores that in main memory. - The positive delta filter is a delta encoding filter that ensures that it only stores positive deltas. On negative deltas, this filter's execution will return with an error. - The WebP filter takes raw colorspace values and converts them to WebP format. This filter supports lossy compression of imaging data.
TileDB does not provide transactional support, as it is a storage engine. It only guarantees atomic reads and writes. TileDB allows users to build a transactional manager on top for concurrency control. TileDB also supports data versioning, which is not MVCC, but can provide some of the functionality of MVCC.
TileDB's data model supports the storage of both dense and sparse arrays.
The data model of TileDB arrays allows it to support any number of dimensions. For dense arrays, the dimension types must be uniform, and they all must be either integer types, datetime types, or time types, which are all internally stored as integer types. TileDB only supports integer type dimensions for dense arrays to allow coordinates to be implicitly defined. For sparse arrays, the dimension types in a domain can be heterogeneous (e.g. they can be float or string), and coordinates are explicitly stored in memory. A set of dimensions for an array is called a domain.
An array element is defined by a unique set of dimension values or coordinates, and it is called a cell. In dense arrays, all cells must store exactly one value. In sparse arrays, cells can be empty, store one value, or store multiple values. Each logical cell contains the data from the defined attributes in the array schema. Attributes can have heterogeneous types for both sparse and dense arrays.
TileDB uses an R-tree as an index to implement sparse array slicing. On write, TileDB builds an R-tree index on the non-empty cells of the sparse array. To do this, it groups the coordinates of the non-empty cells into minimum bounding rectangles, then recursively groups these rectangles into a tree structure. On read, TileDB determines which minimum bounding rectangles overlap the query coordinates. Then, it uses parallel processing to collect these rectangles, decompress them, individually check the coordinates of the data collected, and retrieve the attribute data that matches the query.
Decomposition Storage Model (Columnar)
TileDB supports columnar format for different attributes stored in arrays.