Tajo

View Current Viewing Revision #16 from 04/01/2023 9:24 p.m.

OLAP

Apache Tajo was developed inspired by Google's paper "Dremel: Interactive Analysis of Web-Scale Datasets". Dremel is a system that provides a distributed column-oriented storage and column-oriented SQL query engine used to process large amounts of data, and Tajo likewise provides a column-oriented data store and SQL query engine.

Apache Tajo is an open-source big data relational and distributed data warehouse system that provides fault-tolerant analytical processing on large-scale datasets. It is compatible with Apache Hadoop and HDFS and supports SQL standards including complex queries, joins, and aggregations.

Apache Tajo is designed to be scalable and can process massive data sets with tens of thousands of nodes. It supports various file formats, including CSV, TSV, ORC, and Parquet. The Tajo constructs a master-slave cluster with master nodes and worker nodes. The master nodes manage the cluster and coordinate query execution, while the worker nodes perform the actual data processing.

Tajo also supports user-defined functions (UDFs), which allow users to extend the functionality of Tajo with their custom logic. Additionally, Tajo includes a web-based user interface and a command-line interface for managing and querying data. For optimizations, Tajo provides a cost-based optimization model and an expandable rewrite rule. A commercial solution with similar functionality is Cloudera's Impala.

History

2012: Started by Hyunsik Choi and Jihoon Son as a project of Korea University's DB Lab.

2013-03: Developers from Gruter, Korea University, LinkedIn, Nasa, HortonWorks, and Intel participated and adopted it as an incubation project of the Apache Foundation.

2014-03: Became Apache Top-Level Project (TLP)

2019-12: Released latest stable version (Tajo 0.12.0)

2020-09: The project was marked as abandoned and deprecated to the Apache Foundation "attic".

Checkpoints

Not Supported

Tajo's checkpoint functionality relies on HDFS which is fault tolerant with data replication. Tajo only considers fault tolerance with reference to the query execution strategy. Since Tajo aims at Datawarehouse / OLAP queries, It reassigns failed tasks to other workers.

Although not a checkpoint, Tajo provides catalog backup and restore capabilities in the form of SQL dumps and database-level backups.

Compression

Bit Packing / Mostly Encoding

Tajo provides compression according to the data format. Compression only affects the stored data format and it is enabled when a table is created.

text / json / rcfile / sequencefile data format: Classes supported by Hadoop are used for these formats. Hadoop's known compression classes include GZip2Codec, DefaultCodec, GzipCodec, and PassthroughCodec.

pargquet data format: snappy, gzip, and lzo are supported for the parquet data format.

orc data format: snappy and zlib are supported for the orc data format.

Data Model

Relational

Tajo's data model follows the relational data table. Data is organized into tables, where each table is uniquely named with rows and columns which represent a data attribute. The Tajo data model, which is compatible with SQL, allows data to be manipulated and queried using SQL. Tajo supports multiple data formats, such as TEXT, JSON, RCFile, ORC, Sequence, and Parquet files. In addition, database connection methods such as JDBC are supported for linking with external data sources such as HiveMetaStore.

Indexes

AVL-Tree

Tajo supports only one type of index, TWO_LEVEL_BIN_TREE, shortly BST. The BST index is a binary search tree, consisting of two levels of nodes; a leaf node indexes the kyes with the offsets to data stored on HDFS, and a root node indexes the keys with the offsets to the leaf nodes.

The query engine first reads the root node and finds the search key in an index scan. If it successfully finds the leaf node corresponding to the search key, it finds the search key on that leaf node and reads the tuple directly from HDFS. Users can create an index using SQL.

Joins

Hash Join Broadcast Join

Tajo supports various join strategies used in shared-nothing databases (or Apache Hive). There are two types of Join: Broadcast Join and Reparition Join (hash and range). Tajo requires two phases and can mix various join algorithms. In First Phase, Tajo scans the data set and filters by selection push-down. The scanned result is hashed or range repartitioned. In the Second Phase, a hash join or a merge join in case of a range partition is executed.

If the larger table is sorted on a joining key, Tajo implements a decentralized join strategy. Smaller tables are repartitioned via range repartition first. Then, Tajo assigns the range partitions to nodes whose large table corresponds to the join the key range. As the last step, each node performs the merge join.

Parallel Execution

Intra-Operator (Horizontal)

Tajo parallelizes requests from clients in the form of a distributed system. Tajo supports distributed execution with a master-worker structure. TajoMaster serves multiple clients and assigns queries to the QueryMaster. When a query is assigned to the query master, it is reconstructed in the form of multiple TaskRunners, delivered to the nodes of the distributed system, and executed.

Query Compilation

JIT Compilation

According to Hadoop Summit 2014, A JIT-based vectorization engine is introduced, and the JIT is used to generate bytecode for vectorization primitives at runtime.

Query Execution

Vectorized Model

According to Hadoop Summit 2014, the previous version of Tajo used a tuple-at-a-time model with a simple interface and all arbitrary operator combinations, but it suffered performance degradation due to creating too many function calls and branches and low data/instruction cache hits, so it introduced a JIT-based vectorization engine.

Query Interface

SQL Command-line / Shell

Tajo supports SQL standards.

Storage Architecture

Disk-oriented

Storage Format

Apache Parquet Apache ORC SequenceFile

Tajo provides a split tool to split an input data set into multiple fragments. In addition, Scanner and Appender interfaces are provided to users to access specialized data structures.

Tajo provides various row/columnar store file formats, such as CSVFIle, RowFile, RCFile, and Trevni. Tajo supports saving according to the file format by providing a wrapper for each file format.

Storage Model

Custom

System Architecture

Shared-Disk

The architecture of Tajo follows the master-worker model and employs Hadoop Yarn as a resource manager for large clusters. TajoMaster dedicated server for providing client service and coordinating QueryMasters. For each query, Tajo deploys one QueryMaster and several TaskRunners together. TaskRunner includes a local query engine that executes a directed acyclic graph (DAG) of physical operators.

Revision #16 | Updated 04/01/2023 9:24 p.m.