Apache Tajo is an open-source big data relational and distributed data warehouse system that provides fault-tolerant analytical processing on large-scale datasets. Apache Tajo was developed inspired by BigQuery. Ostrich is "Tajo" in Korean and has long legs, which is used as a homonym for "Bridge". Apache Tajo represents the role that the database plays in connecting data sources with analysis tools. Apache Tajo provides a distributed column-oriented storage and column-oriented SQL query engine used to process large amounts of data. It is compatible with Apache Hadoop and HDFS and supports SQL standards including complex queries, joins, and aggregations.
Apache Tajo is designed to be scalable and can process massive data sets with tens of thousands of nodes. It supports various file formats, including CSV, TSV, ORC, and Parquet. The Tajo constructs a master-slave cluster with master nodes and worker nodes. The master nodes manage the cluster and coordinate query execution, while the worker nodes perform the actual data processing.
Tajo also supports user-defined functions (UDFs), which allow users to extend the functionality of Tajo with their custom logic. Additionally, Tajo includes a web-based user interface and a command-line interface for managing and querying data. For optimizations, Tajo provides a cost-based optimization model and an expandable rewrite rule.
Apache Tajo was started by Hyunsik Choi and Jihoon Son as a project of Korea University's DB Lab in 2012. Developers from Gruter, Korea University, LinkedIn, Nasa, HortonWorks, and Intel participated and adopted it as an incubation project of the Apache Foundation in March 2013. A year later, in March 2014, Tajo became Apache's Top-Level Project (TLP). However, the latest stable version (Tajo 0.12.0) was released in December 2019 and was marked as abandoned and deprecated by the Apache Foundation "attic" in September 2020.
Tajo's checkpoint functionality relies on HDFS which is fault tolerant with data replication. Tajo only considers fault tolerance with reference to the query execution strategy. Since Tajo targets OLAP queries, It reassigns failed tasks to other workers.
Although not a checkpoint, Tajo provides catalog backup and restore capabilities in the form of SQL dumps and database-level backups.
Tajo provides compression according to the data format. Compression only affects the stored data format and it is enabled when a table is created.
text / json / rcfile / sequencefile data format: classes include GZip2Codec, DefaultCodec, GzipCodec, and PassthroughCodec.
pargquet data format: snappy, gzip, and lzo are supported.
orc data format: snappy and zlib are supported.
Tajo organizes data into tables, where each table is uniquely named with rows and columns which represent a data attribute. The Tajo data model, which is compatible with SQL, allows data to be manipulated and queried using SQL.
Tajo supports only one type of index,
TWO_LEVEL_BIN_TREE (BST). The BST index is a binary search tree, consisting of two levels of nodes; a leaf node indexes the kyes with the offsets to data stored on HDFS, and a root node indexes the keys with the offsets to the leaf nodes.
The query engine first reads the root node and finds the search key in an index scan. If it successfully finds the leaf node corresponding to the search key, it finds the search key on that leaf node and reads the tuple directly from HDFS. Users can create an index using SQL.
Tajo supports various join strategies used in shared-nothing databases. There are two types of Join: Broadcast Join and Reparition Join (hash and range). Tajo requires two phases, and chooses hash or merge join depending on whether the table is sorted by a key or not. In First Phase, Tajo scans the data set and filters by selection push-down. The scanned result is hashed or range repartitioned. In the Second Phase, a hash join or a merge join in case of a range partition is executed.
If the larger table is sorted on a joining key, Tajo implements a decentralized join strategy. Smaller tables are repartitioned via range repartition first. Then, Tajo assigns the range partitions to nodes whose large table corresponds to the join the key range. As the last step, each node performs the merge join.
Tajo parallelizes requests from clients in the form of a distributed system. Tajo supports distributed execution with a master-worker structure. TajoMaster serves multiple clients and assigns queries to the QueryMaster. When a query is assigned to the query master, it is reconstructed in the form of multiple TaskRunners, delivered to the nodes of the distributed system, and executed.
Tajo added a JIT-based vectorization engine in Hadoop Summit 2014, and the JIT is used to generate JVM bytecode for vectorization primitives at runtime.
The original version of Tajo used a tuple-at-a-time model with a simple interface and all arbitrary operator combinations, but it suffered performance degradation due to creating too many function calls and branches and low data/instruction cache hits, so it introduced a JIT-based vectorization engine.
As part of vectorization, Tajo does columnar processing on primitive arrays and performs JIT compilation to create vectorized primitives. An unsafe-based in-memory structure prevents additional object creation and the use of an unsafe-based Cuckoo hash table enables lookup without garbage collection. As a result, cache hit was improved by creating primitives fitted to the cache, and CPU cost was decreased by reducing branches in CPU pipelining.
Tajo supports SQL and provides two query interfaces: interactive query interface and Tajo web console.
Interactive Query Interface: Tajo provides a command-line interface (CLI) to execute SQL queries directly. This interface allows users to see the results of SQL queries over the terminal.
Tajo Web Console: It is an interactive interface that allows users to submit and execute Tajo queries through a web browser. This interface allows users to view query execution results graphically and monitor the performance and progress of queries.
Tajo's TaskRunner manages data in HDFS and Local File System through a storage manager. Because it handles distributed-based queries, Tajo performs data repartition like shuffle of MapReduce, and each TaskRunner accesses data through HDFS and caches it in the local file system.
Tajo provides a split tool to divide an input data set into multiple fragments. In addition, Scanner and Appender interfaces are provided to users to access specialized data structures. If users want to use their own scanners or appenders, Tajo allows users can implement through user-defined functions.
Tajo supports multiple data formats, such as TEXT, JSON, ORC, Sequence, and Parquet files. In addition, database connection methods such as JDBC are supported for linking with external data sources such as HiveMetaStore. Tajo provides various row/columnar store file formats, such as CSVFIle, RowFile, RCFile, and Trevni. Tajo supports saving according to the file format by providing a wrapper for each file format.
The architecture of Tajo follows the master-worker model and employs Hadoop Yarn as a resource manager for large clusters. TajoMaster dedicated server for providing client service and coordinating QueryMasters. For each query, Tajo deploys one QueryMaster and several TaskRunners together. TaskRunner includes a local query engine that executes a directed acyclic graph (DAG) of physical operators and a storage manager that accesses HDFS and local file systems.
Hyunsik Choi and Jihoon Son, who was a member of Korea University's DB Laboratory