Tajo

Viewing Revision #15 from 2023-04-02 01:07 View Current

Apache Tajo was developed inspired by Google's paper "Dremel: Interactive Analysis of Web-Scale Datasets". Dremel is a system that provides a distributed column-oriented storage and column-oriented SQL query engine used to process large amounts of data, and Tajo likewise provides a column-oriented data store and SQL query engine.[05][06][01]

Logo Versions

Website: https://tajo.apache.org/[01]
Source Code: https://git-wip-us.apache.org/repos/asf?p=tajo.git[02]
Tech Docs: https://tajo.apache.org/docs/current/[03]
Developer: Hyunsik Choi and Jihoon Son
Country of Origin: KR
Start Year: 2012 [08]
End Year: 2020
Acquired By: Apache Software Foundation
Project Type: Open Source
Written in: Java
Supported Languages: Bash, Java, Perl, Python, Ruby
Compatible With: Hive
Operating System: All OS with Java VM
License: Apache v2
Twitter: @ApacheTajo[04]

Apache Tajo is an open-source big data relational and distributed data warehouse system that provides fault-tolerant analytical processing on large-scale datasets. It is compatible with Apache Hadoop and HDFS and supports SQL standards including complex queries, joins, and aggregations.

Apache Tajo is designed to be scalable and can process massive data sets with tens of thousands of nodes. It supports various file formats, including CSV, TSV, ORC, and Parquet. The Tajo constructs a master-slave cluster with master nodes and worker nodes. The master nodes manage the cluster and coordinate query execution, while the worker nodes perform the actual data processing.

Tajo also supports user-defined functions (UDFs), which allow users to extend the functionality of Tajo with their custom logic. Additionally, Tajo includes a web-based user interface and a command-line interface for managing and querying data. For optimizations, Tajo provides a cost-based optimization model and an expandable rewrite rule. A commercial solution with similar functionality is Cloudera's Impala.

Database Entry

Tajo

Viewing Revision #15 from 2023-04-02 01:07 View Current

OLAP

History[07][01]

2012: Started by Hyunsik Choi and Jihoon Son as a project of Korea University's DB Lab.

2013-03: Developers from Gruter, Korea University, LinkedIn, Nasa, HortonWorks, and Intel participated and adopted it as an incubation project of the Apache Foundation.

2014-03: Became Apache Top-Level Project (TLP)

2019-12: Released latest stable version (Tajo 0.12.0)

2020-09: The project was marked as abandoned and deprecated to the Apache Foundation "attic".

Checkpoints[08][09]

Not Supported

Tajo's checkpoint functionality relies on HDFS which is fault tolerant with data replication. Tajo only considers fault tolerance with reference to the query execution strategy. Since Tajo aims at Datawarehouse / OLAP queries, It reassigns failed tasks to other workers.

Although not a checkpoint, Tajo provides catalog backup and restore capabilities in the form of SQL dumps and database-level backups.

Compression[10][11]

Bit Packing / Mostly Encoding

Tajo provides compression according to the data format. Compression only affects the stored data format and it is enabled when a table is created.

text / json / rcfile / sequencefile data format: Classes supported by Hadoop are used for these formats. Hadoop's known compression classes include GZip2Codec, DefaultCodec, GzipCodec, and PassthroughCodec.

pargquet data format: snappy, gzip, and lzo are supported for the parquet data format.

orc data format: snappy and zlib are supported for the orc data format.

Data Model[12][13][14][15]

Relational

Tajo's data model follows the relational data table. Data is organized into tables, where each table is uniquely named with rows and columns which represent a data attribute. The Tajo data model, which is compatible with SQL, allows data to be manipulated and queried using SQL. Tajo supports multiple data formats, such as TEXT, JSON, RCFile, ORC, Sequence, and Parquet files. In addition, database connection methods such as JDBC are supported for linking with external data sources such as HiveMetaStore.

Indexes[16]

AVL-Tree

Tajo supports only one type of index, TWO_LEVEL_BIN_TREE, shortly BST. The BST index is a binary search tree, consisting of two levels of nodes; a leaf node indexes the kyes with the offsets to data stored on HDFS, and a root node indexes the keys with the offsets to the leaf nodes.

The query engine first reads the root node and finds the search key in an index scan. If it successfully finds the leaf node corresponding to the search key, it finds the search key on that leaf node and reads the tuple directly from HDFS. Users can create an index using SQL.

Joins[05][17]

Hash Join Broadcast Join

Tajo supports various join strategies used in shared-nothing databases (or Apache Hive). There are two types of Join: Broadcast Join and Reparition Join (hash and range). Tajo requires two phases and can mix various join algorithms. In First Phase, Tajo scans the data set and filters by selection push-down. The scanned result is hashed or range repartitioned. In the Second Phase, a hash join or a merge join in case of a range partition is executed.

If the larger table is sorted on a joining key, Tajo implements a decentralized join strategy. Smaller tables are repartitioned via range repartition first. Then, Tajo assigns the range partitions to nodes whose large table corresponds to the join the key range. As the last step, each node performs the merge join.

Parallel Execution[05]

Intra-Operator (Horizontal)

Tajo parallelizes requests from clients in the form of a distributed system. Tajo supports distributed execution with a master-worker structure. TajoMaster serves multiple clients and assigns queries to the QueryMaster. When a query is assigned to the query master, it is reconstructed in the form of multiple TaskRunners, delivered to the nodes of the distributed system, and executed.

Query Compilation[18]

JIT Compilation

In Tajo's 0.8.0 release, A JIT-based vectorization engine is introduced, and the JIT is used to generate bytecode for vectorization primitives at runtime.

Query Execution[18]

Vectorized Model

Tajo's 0.8.0 release,

Tajo does distributed query execution.

Tajo implements a Distributed Query Execution Plan (DQEP). DQEP is a directed acyclic graph (DAG) of execution blocks.

Query Interface

SQL Command-line / Shell

Tajo supports SQL standards.

Storage Architecture

Disk-oriented

Storage Format

Parquet ORC SequenceFile

Tajo provides a split tool to split an input data set into multiple fragments. In addition, Scanner and Appender interfaces are provided to users to access specialized data structures.

Tajo provides various row/columnar store file formats, such as CSVFIle, RowFile, RCFile, and Trevni. Tajo supports saving according to the file format by providing a wrapper for each file format.

Storage Model

Custom

System Architecture

Shared-Disk

The architecture of Tajo follows the master-worker model and employs Hadoop Yarn as a resource manager for large clusters. TajoMaster dedicated server for providing client service and coordinating QueryMasters. For each query, Tajo deploys one QueryMaster and several TaskRunners together. TaskRunner includes a local query engine that executes a directed acyclic graph (DAG) of physical operators.

Citations

18 sources

Tajo - A Big Data Warehouse System on Hadoop - apache.org Modified: 2025-03-25 Accessed: 2026-06-05
https://git-wip-us.apache.org/repos/asf?p=tajo.git apache.org Accessed: 2026-06-05
Apache Tajo™ (0.11.3 Release) - User documentation — Apache Tajo 0.11.3 documentation apache.org Modified: 2016-05-18 Accessed: 2026-06-05
https://twitter.com/ApacheTajo twitter.com
Tajo: A Distributed Data Warehouse System for Hadoop | PDF slideshare.net Accessed: 2026-06-02
타조 (소프트웨어) - 위키백과, 우리 모두의 백과사전 wikipedia.org Modified: 2025-04-17 Accessed: 2026-06-08
Tajo | The Apache Attic apache.org Modified: 2026-04-21 Accessed: 2026-06-01
https://ieeexplore.ieee.org/document/6544934 ieee.org Accessed: 2026-06-02
Backup and Restore Catalog — Apache Tajo 0.12.0-SNAPSHOT documentation apache.org Modified: 2016-05-20 Accessed: 2026-06-02
CompressionCodec (Apache Hadoop Main 3.5.0 API) apache.org Modified: 2026-04-03 Accessed: 2026-06-02
Compression — Apache Tajo 0.11.3 documentation apache.org Modified: 2016-05-18 Accessed: 2026-06-02
Data Formats — Apache Tajo 0.12.0-SNAPSHOT documentation apache.org Modified: 2016-05-20 Accessed: 2026-06-02
Tablespaces — Apache Tajo 0.12.0-SNAPSHOT documentation apache.org Modified: 2016-05-20 Accessed: 2026-06-02
https://tajo.apache.org/docs/devel/jdbc_driver.html?highlight=jdbc apache.org Dead — Check Archive Accessed: 2026-06-02
Data Model — Apache Tajo 0.11.3 documentation apache.org Modified: 2016-05-18 Accessed: 2026-06-02
Index Types — Apache Tajo 0.11.3 documentation apache.org Modified: 2016-05-18 Accessed: 2026-06-02
Joins — Apache Tajo 0.11.3 documentation apache.org Modified: 2016-05-18 Accessed: 2026-06-02
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine | PPTX slideshare.net Accessed: 2026-06-02

Revision #15 Last Updated: 2023-04-01 21:07