Pinot is an open-source distributed relational OLAP database management system written by LinkedIn. It is designed to support large-scale real-time analytics on various data sets. For use cases that are sensitive to data freshness, Pinot is able to directly ingest streaming data from Kafka. For applications that can tolerate a data time lag of few hours to a day, Pinot is able to load batch data from Hadoop / HDFS. It's also possible for Pinot to dynamically merge data streams that come from both offline and online.
Pinot divides tables into segments, which are immutable sets of tuples. Tuples inside each segment are organized in columnar manner. Segments are basic units in Pinot. In server nodes, data from Kafka and Hadoop / HDFS are processed and cached as segments. They store metadata, indexes, and necessary zone maps for their tuples. Storage optimizations are applied inside each segment; Indexes are built for each segment. Query plans and optimizations are also generated and performed on a per-segment basis.
An internal building block of Pinot is Apache Helix. It's used to manage the cluster such as replicas and redundant nodes. Pinot also relies on two external components, Zookeeper and object stores, to provide persistent storage for global metadata and database.
Pinot was first developed by LinkedIn in 2014 as an internal analytics infrastructure. It originated from the demands to scale out OLAP systems to support low-latency real-time queries on huge volume data. It was later open-sourced in 2015 and entered Apache Incubator in 2018. Pinot was named after the Pinot noir, name of a grape varietal that can produce the most complex wine but is the toughest to grow and process. It's a portrayal of data: powerful but hard to analyze.
Pinot is a relational database management system. The data type of each attribute can be integers with various length, floating-point numbers, strings, booleans, and arrays. The column type of each attribute can be dimensions, metrics, and timestamps.
Pinot supports both Vectorized Model and Tuple-at-a-Time Model. Which one to use depends on the query type and the organization of column data. Bulk optimizations can be made if a target column has been physically reordered. Pinot also leverages the zone map of each segment to accelerate queries.
Queries are split into subqueries and executed in parallel on corresponding segments.
Pinot leverages various types of encoding, as is listed above, to reduce storage overhead. The typical size of a segment varies from a few hundred megabytes to a few gigabytes. Different data encoding techniques have different specialized physical operators to optimize query execution.
Pinot supports pluggable Sorted Index, BitMap Index, and Inverted Index. BitMap Index is used to optimize queries on categorical data. Inverted Index is used to support lookup by key word. They are chosen to make use of the features of social data, which are usually categorical and textual.
Inverted Index can be built based on BitMap. BitMap Index can be optimized with various compression techniques. Data columns can also be physically reordered to optimize some specific queries in Pinot, since filters on such column will end up targeting a contiguous range of the column data.
Pinot consists of four parts: servers, controllers, brokers, and minions. They together support the functionality of data storage, data management, and query processing. A brief introduction to them is as below:
Servers are responsible for temporary data storage and query execution. Pinot stores segments in each server node in distributed manner. Each segment is loaded from external data source under the control of controller nodes. A segment has multiple replicas to improve throughput.
Controllers are responsible for maintaining global metadata and system state. They are implemented with Apache Helix.
Brokers are responsible for routing queries and gathering results. They control the flow of queries such as where each query should go to and how to generate the final result with intermediate results from different nodes.
Minions are responsible for running maintenance tasks, which are usually time consuming and should not influence the running queries.
In a typical process to load a segment, controller nodes first tell server nodes to fetch segments. Server nodes then fetch global metadata and load segments from corresponding external object store nodes. Finally, controller nodes and broker nodes update global metadata and cluster states.
In a typical process to execute a query from a client, Broker nodes first pick a routing table for the query and contact corresponding Server nodes. Server nodes then execute the query based on what segments they have. Results are gathered, merged, and returned to Broker nodes. Broker nodes finally process the result to see if there is error or timeout and then reply to the client.
Pinot uses Pinot Query Language (PQL) as its query interface, which is a subset of SQL. PQL supported query operations are selection, ordering and pagination on selection, filtering, aggregation, and grouping on aggregation. It does not support joins, nested queries, record-level creation, updates, deletion, or any data definition language (DDL).
Grouping on aggregation have a default truncation of top 10 result tuples. PQL uses
TOP n to set this truncation.
Pinot uses replicas to provide fault tolerance and high availability. It also uses redundant controller instances to improve availability.
However, checkpoints are not supported since segments are immutable, which means there is no write on segments during the execution of queries. But it's possible for a segment to be entirely replaced with a newer version.
Pinot uses the PAX storage model, which divides tuples into segments and stores data inside each segment in columnar manner. A segment is immutable and typically contains tens of millions of rows. It also stores metadata, indexes, and zone maps for its tuples.
Pinot server nodes store segments in directories of UNIX file system. Each such directory contains a metadata file and an index file. The metadata file stores information about tuple columns in the segment. The index file stores indexes for all the columns. The global metadata about segments, including the mapping of a segment to the node where it is positioned, is maintained in controller nodes.
Each segment is indeed a cache of data, which has an expiration time to ensure certain data freshness. The original sources of segments are external systems like Kafka and Hadoop / HDFS.