Tajo

View Current Viewing Revision #10 from 04/01/2023 5:24 p.m.

OLAP

Apache Tajo was developed inspired by Google's paper "Dremel: Interactive Analysis of Web-Scale Datasets". Dremel is a system that provides a distributed column-oriented storage and column-oriented SQL query engine used to process large amounts of data, and Tajo likewise provides a column-oriented data store and SQL query engine.

Apache Tajo is an open-source big data relational and distributed data warehouse system that provides fault-tolerant analytical processing on large-scale datasets. It is compatible with Apache Hadoop and HDFS and supports SQL standards including complex queries, joins, and aggregations.

Apache Tajo is designed to be scalable and can process massive data sets with tens of thousands of nodes. It supports various file formats, including CSV, TSV, ORC, and Parquet. The Tajo constructs a master-slave cluster with master nodes and worker nodes. The master nodes manage the cluster and coordinate query execution, while the worker nodes perform the actual data processing.

Tajo also supports user-defined functions (UDFs), which allow users to extend the functionality of Tajo with their custom logic. Additionally, Tajo includes a web-based user interface and a command-line interface for managing and querying data. For optimizations, Tajo provides a cost-based optimization model and an expandable rewrite rule. A commercial solution with similar functionality is Cloudera's Impala.

History

2012: Started by Hyunsik Choi and Jihoon Son as a project of Korea University's DB Lab.

2013-03: Developers from Gruter, Korea University, LinkedIn, Nasa, HortonWorks, and Intel participated and adopted it as an incubation project of the Apache Foundation.

2014-03: Became Apache Top-Level Project (TLP)

2019-12: Released latest stable version (Tajo 0.12.0)

2020-09: The project was marked as abandoned and deprecated to the Apache Foundation "attic".

Compression

Bit Packing / Mostly Encoding

Tajo provides compression according to the data format. Compression only affects the stored data format and it is enabled when a table is created.

text / json / rcfile / sequencefile data format: Classes supported by Hadoop are used for these formats. Hadoop's known compression classes include GZip2Codec, DefaultCodec, GzipCodec, and PassthroughCodec.

pargquet data format: snappy, gzip, and lzo are supported for the parquet data format.

orc data format: snappy and zlib are supported for the orc data format.