Kylin

View Current Viewing Revision #9 from 03/18/2019 11:36 a.m.

Kylin is an open source distributed data analytics engine on top of Hadoop/Spark. It offers SQL interface to do OLAP on large datasets.

Unlike massive parallel processing engines like Hive and Presto, Kylin pre-calculates a set of data cubes, stores them in HBase, and directly looks up the results in them. If a query cannot be answered by the data cubes, it will be executed by the underlying process engine. In this way, Kylin is usually used as an accelerator of traditional parallel data processing engines.

History

The Kylin project was started in 2013, from eBay's R&D in Shanghai, China. It was open sourced on Github as "KylinOLAP" in Oct 2014.

In Nov 2015, Kylin joined Apache Software Foundation incubator;

In Dec 2015, Apache Kylin became a Top Level Project.

Query Execution

Tuple-at-a-Time Model

Kylin uses Apache Calcite to parse, generate and optimize execution plans.

Indexes

Not Supported

Foreign Keys

Supported

Kylin supports star schema and snowflake schema. A user needs to specify fact tables and lookup tables before building cubes.

Data Model

Key/Value

Data cubes are essentially HBase tables. Given a dimension column set, Kylin pre-aggregates all possible combinations of their attributes by map-reduce jobs, then encode the dimensions with dictionary encoding. Finally, Kylin encodes all data cubes to Rowkeys in HBase. The format of a Rowkey is cuboid id + attribute. For example, assume a data cube on year and city with cuboid id 00000001, and there is a row year=1994, city=Beijing, sum(sales)=100, and a dictionary maps 1994=0, Beijing=1, there will be an entry in the HBase table Rowkey=00000001+01, value=100.

Kylin applies dictionary encoding to all dimension values in data cubes. Kylin's dictionary is order-preserving and supports mapping both from keys to values and vice versa. The dictionary is implemented as a radix tree.

Besides, Kylin also supports naive compression in HBase and Hive.