Tajo

View Current Viewing Revision #12 from 04/01/2023 8:05 p.m.

OLAP

Apache Tajo was developed inspired by Google's paper "Dremel: Interactive Analysis of Web-Scale Datasets". Dremel is a system that provides a distributed column-oriented storage and column-oriented SQL query engine used to process large amounts of data, and Tajo likewise provides a column-oriented data store and SQL query engine.

Apache Tajo is an open-source big data relational and distributed data warehouse system that provides fault-tolerant analytical processing on large-scale datasets. It is compatible with Apache Hadoop and HDFS and supports SQL standards including complex queries, joins, and aggregations.

Apache Tajo is designed to be scalable and can process massive data sets with tens of thousands of nodes. It supports various file formats, including CSV, TSV, ORC, and Parquet. The Tajo constructs a master-slave cluster with master nodes and worker nodes. The master nodes manage the cluster and coordinate query execution, while the worker nodes perform the actual data processing.

Tajo also supports user-defined functions (UDFs), which allow users to extend the functionality of Tajo with their custom logic. Additionally, Tajo includes a web-based user interface and a command-line interface for managing and querying data. For optimizations, Tajo provides a cost-based optimization model and an expandable rewrite rule. A commercial solution with similar functionality is Cloudera's Impala.

History

2012: Started by Hyunsik Choi and Jihoon Son as a project of Korea University's DB Lab.

2013-03: Developers from Gruter, Korea University, LinkedIn, Nasa, HortonWorks, and Intel participated and adopted it as an incubation project of the Apache Foundation.

2014-03: Became Apache Top-Level Project (TLP)

2019-12: Released latest stable version (Tajo 0.12.0)

2020-09: The project was marked as abandoned and deprecated to the Apache Foundation "attic".

Compression

Bit Packing / Mostly Encoding

Tajo provides compression according to the data format. Compression only affects the stored data format and it is enabled when a table is created.

text / json / rcfile / sequencefile data format: Classes supported by Hadoop are used for these formats. Hadoop's known compression classes include GZip2Codec, DefaultCodec, GzipCodec, and PassthroughCodec.

pargquet data format: snappy, gzip, and lzo are supported for the parquet data format.

orc data format: snappy and zlib are supported for the orc data format.

Data Model

Relational

Tajo's data model follows the relational data table. Data is organized into tables, where each table is uniquely named with rows and columns which represent a data attribute. The Tajo data model, which is compatible with SQL, allows data to be manipulated and queried using SQL. Tajo supports multiple data formats, such as TEXT, JSON, RCFile, ORC, Sequence, and Parquet files. In addition, database connection methods such as JDBC are supported for linking with external data sources such as HiveMetaStore.

Joins

Hash Join Broadcast Join

Tajo supports various join strategies used in shared-nothing databases (or Apache Hive). There are two types of Join: Broadcast Join and Reparition Join (hash and range). Tajo requires two phases and can mix various join algorithms. In First Phase, Tajo scans the data set and filters by selection push-down. The scanned result is hashed or range repartitioned. In the Second Phase, a hash join or a merge join in case of a range partition is executed.

If the larger table is sorted on a joining key, Tajo implements a decentralized join strategy. Smaller tables are repartitioned via range repartition first. Then, Tajo assigns the range partitions to nodes whose large table corresponds to the join the key range. As the last step, each node performs the merge join.

Parallel Execution

Intra-Operator (Horizontal)

Tajo parallelizes requests from clients in the form of a distributed system. Tajo supports distributed execution with a master-worker structure. TajoMaster serves multiple clients and assigns queries to the QueryMaster. When a query is assigned to the query master, it is reconstructed in the form of multiple TaskRunners, delivered to the nodes of the distributed system, and executed.

Query Execution

Vectorized Model

Tajo does distributed query execution.

Tajo implements a Distributed Query Execution Plan (DQEP). DQEP is a directed acyclic graph (DAG) of execution blocks.

Tajo provides a split tool to split an input data set into multiple fragments. In addition, Scanner and Appender interfaces are provided to users to access specialized data structures.

Tajo provides various row/columnar store file formats, such as CSVFIle, RowFile, RCFile, and Trevni. Tajo supports saving according to the file format by providing a wrapper for each file format.

Storage Model

Custom

System Architecture

Shared-Disk

The architecture of Tajo follows the master-worker model and employs Hadoop Yarn as a resource manager for large clusters. TajoMaster dedicated server for providing client service and coordinating QueryMasters. For each query, Tajo deploys one QueryMaster and several TaskRunners together. TaskRunner includes a local query engine that executes a directed acyclic graph (DAG) of physical operators.

Revision #12 | Updated 04/01/2023 8:05 p.m.