Kylin

Kylin is an open source distributed data analytics engine on top of SparkSQL/Hive. It offers SQL interface to do OLAP on large datasets. Kylin is not a replacement for a massive parallel processing engine like Hive and Presto. It runs on top of these systems as a query accelerator.

The way Kylin works is that it pre-calculates a set of data cubes, stores them in HBase, and directly looks up the results in them when receiving queries. If a query cannot be answered by the data cubes, it will be executed by the underlying process engine (like Hive). The data cubes are built when the dataset is imported. The user is responsible to specify which data cubes should be built.

History

The Kylin project was started in 2013, from eBay's R&D in Shanghai, China. It was open sourced on Github as "KylinOLAP" in Oct 2014.

In Nov 2015, Kylin joined Apache Software Foundation incubator;

In Dec 2015, Apache Kylin became a Top Level Project.

Compression

Dictionary Encoding

Kylin applies dictionary encoding to all dimension values in data cubes. Kylin's dictionary is order-preserving and supports mapping both from keys to values and vice versa. The dictionary is implemented as a radix tree. Each node in the radix tree also contains the size of its subtree to support mapping values back to keys.

Kylin also supports naive compression algorithms in HBase and Hive.

Data Model

Key/Value

Data cubes are stored as HBase tables. Given a dimension column set, Kylin pre-aggregates all possible combinations of their attributes by map-reduce jobs, then encode the dimensions with dictionary encoding. Finally, Kylin encodes all data cubes to Rowkeys in HBase. The format of a Rowkey is cuboid id + attribute. For example, assume a data cube on year and city with cuboid id 00000001, and there is a row year=1994, city=Beijing, sum(sales)=100, and a dictionary maps 1994=0, Beijing=1, there will be an entry in the HBase table Rowkey=00000001+01, value=100.

Foreign Keys

Supported

Kylin supports star schema and snowflake schema. A user needs to specify fact tables and lookup tables before building cubes. Kylin pre-joins the tables when building data cubes.

Joins

Hash Join Sort-Merge Join

On cube building phase, Kylin use Hive to pre-join the fact table and lookup tables.

On query time, table joins are supported by the Apache Calcite. Calcite will decompose the join operator several single table lookup operators and each of them will be completed by Kylin.

Query Compilation

Code Generation

Apache Calcite does code generation for SQL queries.

Query Execution

Tuple-at-a-Time Model

Kylin relies on HBase to execute queries. HBase is a tuple-at-a-time execution engine.

Query Interface

SQL

Kylin supports a subset of Apache Calcite's supported queries. Since Kylin is a pure OLAP engine, it only supports SELECT queries. INSERT, UPDATE and DELETE are not supported.

Storage Architecture

Disk-oriented

Kylin stores the data cubes in HBase, and stores metadata in HBase or MySQL (MySQL metastore is still under test).

Storage Model

Decomposition Storage Model (Columnar)

Kylin stores its data in HBase, which is a column-family system.

System Architecture

Shared-Disk

Kylin uses HBase to store data cubes, which stores data in HDFS. The raw table is still stored in the underlying system.

Revision #20 | Updated 01/03/2022 11:06 p.m.

Kylin

History

Compression

Data Model

Foreign Keys

Joins

Query Compilation

Query Execution

Query Interface

Storage Architecture

Storage Model

System Architecture

People Also Viewed

Website

Source Code

Tech Docs

Twitter

Developer

Country of Origin

Start Year

Former Name

Project Type

Written in

Embeds / Uses

Licenses

Wikipedia

People Also Viewed