Kudu

View Current Viewing Revision #3 from 12/12/2018 8:04 p.m.

Apache Kudu is an open source storage engine for structured data that is part of the Apache Hadoop ecosystem. The primary intention of Kudu is to allow applications to perform fast big data analytics on rapidly changing data. It was designed for fast performance for OLAP queries. Being a part of the Hadoop ecosystem, Kudu supports the use of Apache data processing frameworks like Spark, Impala or MapReduce on its tables. Kudu tables can also be joined with other Hadoop storage engines like HBase and HDFS. To build a Kudu application developers can use the Java, C++ or Python Kudu APIs that support No-SQL style accesses or SQL style frameworks like Apache Impala.

History

Prior to Kudu, most data storage engines were able to store one type of structured data, static or mutable. Storage engines for static data were unable to make changes to individual records while storage engines for mutable data had a low throughput for sequential reads. Because of this developers typically used two different storage engines for first mutating their data and then performing analytics. Apache Kudu was designed to support both data formats and provide both high throughput sequential-access and random-access queries. Kudu was developed as internal project at Cloudera and become open to the public in September 2016.

Foreign Keys

Not Supported

As of now, Kudu does not support foreign keys.

Storage Model

Decomposition Storage Model (Columnar)

Indexes

Not Supported

Currently Kudu does not support any additional indexes (aside from the primary index). As an alternative Kudu provides the capability to partition hash or range partition the data for quicker access.

Storage Architecture

Disk-oriented

Although, an experimental version of Kudu does rely on persistent memory in a blocked cache, Kudu is primarily disk-oriented.

Checkpoints

Not Supported

Compression

Dictionary Encoding Run-Length Encoding Bit Packing / Mostly Encoding Prefix Compression

Each column in a Kudu table can be encoded in certain ways based on the type of that column. By default, bit packing is used for various int, double and float column types, run-length encoding is used for bool column types and dictionary-encoding for string or binary column types. By default Kudu doesn't compress columns but it supports per-column compression using LZ4, Snappy or zlib compression codecs.

Data Model

Relational

Kudu is a relational database. Unlike traditional relation databases, Kudu also utilizes partitioning data into tablets that are stored on individual servers. All rows within a tablet are ordered by a primary key.

Concurrency Control

Multi-version Concurrency Control (MVCC)

Kudu employs MVCC. Kudu uses an optimistic concurrency model in which readers don't block writers and writes don't block readers. As a result less lock acquisitions are needed during large table scans.

Revision #3 | Updated 12/12/2018 8:04 p.m.