Redshift

Viewing Revision #10 from 2018-12-12 05:49 View Current

Amazon Redshift is a data warehouse product developed by Amazon and is a part of Amazon's cloud platform, Amazon Web Services. Redshift is a relational database management system designed specifically for OLAP and built on top of PostgreSQL and ParAccel's Massive Parallel Processing technology, leveraging its distributed architecture, columnar storage, and column compression to execute exploratory queries. Due to being based off of PostgreSQL, Redshift allows clients to make connections and execute DDL and DML SQL statements using JDBC or ODBC.[04]

Logo Versions

Website: https://aws.amazon.com/redshift/[01]
Tech Docs: https://docs.aws.amazon.com/redshift/latest/dg/welcome.html[02]
Developer: Amazon.com, Inc.
Country of Origin: US
Start Year: 2012
Project Type: Commercial
Written in: C
Derived From: ParAccel, PostgreSQL
Operating System: Hosted
License: Proprietary
Wikipedia: https://en.wikipedia.org/wiki/Amazon_Redshift[03]

Database Entry

Redshift

Viewing Revision #10 from 2018-12-12 05:49 View Current

History[05][06][07][08][09][10]

In July 2011, Amazon invested in ParAccel, a software company that developed a shared-nothing architecture relational database system for analytics and business intelligence. In exchange for its investment, Amazon acquired license rights to ParAccel's database system which would form the foundation of Amazon's own data warehouse solution: Amazon Redshift. After subsequent development by Amazon and integration with AWS, Amazon Redshift was officially announced at the AWS re:invent 2012 conference and, after a limited preview, was released to the general public in February 2013.

Compression[11][12]

Dictionary Encoding Delta Encoding Run-Length Encoding Naïve (Page-Level) Bit Packing / Mostly Encoding

Redshift allows for the columns to be compressed, reducing data size and storing more data within each disk block. This allows for reduced disk I/O and improves query performance. Column compression will be automatically applied when loading data into Redshift using the COPY command but can also be selected manually.

Redshift allows for the following possible compression options.

Raw

Raw encoding stores the data as-is. None of the values are compressed. By default, no compression is applied to values of columns defined as the sort key and values of BOOLEAN, REAL, or DOUBLE datatypes.

Byte-Dictionary

For each 1MB block on disk, a dictionary a created which maps the first 256 unique column values to a single byte. In the original data, those values are replaced with the corresponding single byte. If there are more than 256 unique column data values in a block, any unique data values beyond the first 256 are stored raw. This encoding is primarily suited for columns containing a limited number of character values and does not support BOOLEAN datatypes.

Delta

For each 1MB block on disk, data is stored as the difference relative to the previous value in series. Redshift supports two delta variations, DELTA (supports SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL) which stores difference as 1-byte values and DELTA32K (INT, BIGINT, DATE, TIMESTAMP, DECIMAL) which stores the difference as 2-byte values. Any difference greater than the delta representable is stored raw along with a 1 byte flag.

Mostly

This encoding utilizes packing to reduce storage. In the event that the value cannot be compressed, the original raw value is stored. MOSTLY8 supports SMALLINT, INT, BIGINT, and DECIMAL. MOSTLY16 supports INT, BIGINT, and DECIMAL. MOSTLY32 supports BIGINT and DECIMAL.

Runlength

For each 1MB block on disk, consecutive values are replaced with a corresponding token that indicates the number of repetitions and the value repeated. A separate dictionary of unique values is also created for each 1MB block. Runlength is supported for all datatypes.

Naive Block-level

Under this encoding, each block is compressed with a standard compression algorithm. Particular choices include LZO and ZTSD. For all columns other than the sort key or with types BOOLEAN, REAL, or DOUBLE, LZO is the default compression.

Concurrency Control[13][14][15]

Multi-version Concurrency Control (MVCC)

Although not explicitly stated, Redshift utilizes Multi Version Concurrency Control. In particular, transactions capture a snapshot of the latest committed version of the data at the time a SELECT query, a DML statement, ALTER TABLE statement, CREATE TABLE statement, DROP TABLE statement, or TRUNCATE TABLE statement is executed. Redshift prevents write-write conflicts from happening by forcing a transaction to obtain a table-level write lock and only allowing a transaction to release all its write locks when the transaction either commits or aborts. Furthermore, a VACUUM operation is required in order to remove all records marked for deletion and also perform any resorting operations that may or may not be required.

However, Redshift does not offer deadlock prevention or deadlock detection. As such, Redshift warns users to schedule transaction operations in a way that would prevent any deadlocks from arising in the first place, such as by updating tables in the same order or taking locks in the same order.

Data Model[16][17][18]

Relational

Redshift is a relational database even though it is built upon PostgreSQL. Particular reasons are that Redshift does not support many features considered to be part of the "object-relational" definition, such as but not limited to inheritance and definition of custom structured types.

Foreign Keys[19]

Supported

Redshift supports the concept of foreign keys but does not actually enforce the foreign key constraint. Redshift utilizes foreign keys as pieces of information during the query planning and optimization stage. However, Redshift does not spend computational resources to ensure that the constraint holds, instead relying on the application where the data originated from to ensure that the foreign key constraint is satisfied.

Indexes[16][20]

Not Supported

Redshift does not support indexes. As such, Redshift does not suffer any computational overhead from the creation, maintenance, or concurrent use of index data structures.

Isolation Levels[21][22]

Serializable

Redshift only supports serializable isolation, which provides each transaction with the illusion that they are the only transaction operating on a table at a given time (utilizing table-level locks) and ensures that the end-result is equivalent to some serial execution of the transactions. In the event that a transaction executes an operation which would violate serializability, the violating transaction would be aborted and rolled back.

Joins[23]

Nested Loop Join Hash Join Sort-Merge Join Broadcast Join Shuffle Join

The query planner and optimizer picks the best join and distributed joining algorithm possible. The three join algorithms utilized by Redshift are nested join, hash join which is used for inner and left/right outer joins, and merge join which is used for inner and outer joins. Redshift only uses merge join if the join column is both the distribution and sort key and if the percentage of unsorted data in the two tables is less than 20%.

In the event that the query planner needs to move data around, Redshift will either perform a redistribution (shuffle) or broadcast one side of the join to all other nodes. If redistribution is needed, Redshift may move table data between slices of a single node or between nodes, utilizing the distribution key if the distribution key is part of the join.

Logging[24][25][26]

Logical Logging

Redshift provides logging for both audit purposes and also for all operations executed by transactions on the system. In particular, Redshift logs the raw SQL statements that are executed by users and transactions in the system. Furthermore, with the auditing functionality built-in to Redshift, administrators can also track all the SQL statements executed by a specific user.

Query Compilation[27][28]

Code Generation

The query execution plan is generated at the leader node of a particular Redshift cluster. The leader (coordinator) node is responsible for evaluating all the possible execution plans and cost effectiveness of each plan. The leader node rewrites the query, generates compiled C++ code, and sends the compiled binaries to the compute nodes for execution.

Query Execution[29][30]

Materialized Model

Redshift utilizes the materialized query processing model, where each processing step emits the entire result at a time. The leader node is responsible for coordinating query execution on the compute nodes and stitching together the results of all the compute nodes into a final result that is returned to the user.

To improve performance, Redshift utilizes later materialization where stitching together the row, utilizing the value's block position, is delayed until later steps in the process. Furthermore, Redshift utilizes zone map optimization for its sequential scan, storing a min and max value in the header of each disk block to allow the executor to determine which blocks can be skipped.

Query Interface[31][32][33]

SQL

Redshift supports a majority of the standard DDL statements to create and define tables and DML statements to manipulate the data stored within the database. Furthermore, Redshift supports scalar User-Defined Functions that can be constructed either via a SQL SELECT clause or a Python program.

In addition, there are certain functions that can only be executed on the leader node, primarily functions to query the database schema and implements various extensions to SQL, such as aggregate functions, string functions, and JSON functions although careful care must be taken to the many PostgreSQL features that became unsupported in Redshift.

Storage Architecture[27][34]

Disk-oriented

Redshift uses disk storage. In the distributed system, all the data is stored at the compute node layer. Based on the particular distribution style elected for a particular table, the leader node will either duplicate the data across all the compute nodes or partition the data across all the compute nodes. Furthermore, each compute node will partition its data across its CPU slices in order to achieve maximum parallel computation. To handle node or slice failures, data is duplicated at other nodes in the system.

Storage Model[27][34]

Decomposition Storage Model (Columnar)

Redshift utilizes columnar storage as opposed to row storage. Instead of storing the entire row record together, Redshift stores the values of each table column together. This allows Redshift to pack data together and apply compression in order to minimize disk I/O during query execution. A row can be stitched together by utilizing the offset of a specific value.

Storage Organization[35][36][37]

Heaps Sorted Files

On tables without a sort key specified and that remains unsorted, Redshift preserves the order in which the records are originally inserted in. In the unsorted scenario, Redshift stores data in 1MB blocks on disk where new records are simply appended to the end.

On tables with a sort key specified, Redshift stores the "sorted" portion of the data in sorted blocks on disk. Adding new data to the table except for special case listed below will result in the data being appended to the "unsorted" portion that will only be merged into the "sorted" potion upon a VACUUM. In addition, Redshift provides that the COPY command will automatically sort the incoming data.

In the event that the following conditions are met, adding data to a table will preserve the sortedness of the data and not require an additional VACUUM operation to sort the table. * The sort column is NOT NULL and there is only 1 sort column * Using COPY into an empty table or the table is 100% sorted * The data can be appended in sort-order to the end of any existing data

Stored Procedures[16]

Not Supported

Stored procedures, from PostgreSQL, are not supported.

System Architecture[38][39][27][40]

Shared-Nothing

Redshift clusters take a two-tiered architecture approach. The leader node serves the role as the coordinator and handles accepting client query requests, generating the query plan, the dispatch of query fragments to compute nodes, and the coalescing of results from compute nodes.

The compute nodes utilize a shared-nothing architecture, with each node having dedicated CPU, memory, and disk storage. The leader node is responsible for determining which nodes store what data and will only dispatch a query fragment to nodes that have data needed for the query. For performance reasons, each compute node assigns a portion of memory and disk to each CPU slice for parallel processing.

To ensure data availability during disk or node failure, Redshift utilizes synchronous replication to save redundant copies of the data on other nodes in the system. To provide further durability, Redshift provides complete automated backups to S3 which can then later be used to restore the entire database or a particular table to a cluster.

Views[41][42]

Virtual Views

Redshift supports virtual views. The contents of the view are not directly materialized and the query defining the view is rerun every time the view is being used in another query. DELETE or UPDATE statements cannot be used against the view.

Compatible Systems

PrestoDB

Citations

42 sources

Cloud Data Warehouse – Amazon Redshift – AWS amazon.com Modified: 2026-07-15 Accessed: 2026-07-18
Introduction to Amazon Redshift - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-05
Amazon Redshift - Wikipedia wikipedia.org Modified: 2025-01-25 Accessed: 2026-06-04
Amazon Redshift architecture - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift Leaves On-Premises Opening, Says ParAccel | InformationWeek informationweek.com Modified: 2026-06-07 Accessed: 2026-06-07
Amazon Redshift: ParAccel in, costly appliances out | ZDNET zdnet.com Accessed: 2026-06-07
Research | GigaOm gigaom.com Accessed: 2026-06-07
Amazon Redshift – Now Broadly Available | AWS News Blog amazon.com Modified: 2026-03-09 Accessed: 2026-06-07
Amazon launches RedShift for massive petabyte-scale data analysis in the cloud | VentureBeat venturebeat.com Dead — Check Archive Accessed: 2026-06-02
ParAccel - Wikipedia wikipedia.org Modified: 2026-05-31 Accessed: 2026-06-04
Compression encodings - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift Engineering’s Advanced Table Design Playbook: Compression Encodings | AWS Big Data Blog amazon.com Modified: 2026-03-09 Accessed: 2026-06-07
Managing concurrent write operations - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Write and read/write operations - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
VACUUM - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Unsupported PostgreSQL features - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Talend Blog: The latest on Talend products and company news | Talend talend.com Accessed: 2026-06-07
Unsupported PostgreSQL data types - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Table constraints - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Top 10 performance tuning techniques for Amazon Redshift | AWS Big Data Blog amazon.com Modified: 2026-03-09 Accessed: 2026-06-07
Isolation levels in Amazon Redshift - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
BEGIN - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Creating and interpreting a query plan - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Database audit logging - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
STL_QUERY - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
SVL_STATEMENTTEXT - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
https://dl.acm.org/citation.cfm?id=2723372.2742795 acm.org Accessed: 2026-05-23
Query planning and execution workflow - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift Introduces Late Materialization for Faster Query Processing - AWS amazon.com Modified: 2026-05-27 Accessed: 2026-06-07
Mapping the query plan to the query summary - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift and PostgreSQL - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift SQL - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
User-defined functions in Amazon Redshift - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Columnar storage - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Choose the best sort key - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Forums | AWS re:Post repost.aws Accessed: 2026-06-07
Load data in sort key order - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift now supports table level restore - AWS amazon.com Modified: 2026-05-27 Accessed: 2026-06-07
Data warehouse system architecture - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
Amazon Redshift Engineering’s Advanced Table Design Playbook: Table Data Durability | AWS Big Data Blog amazon.com Modified: 2026-03-09 Accessed: 2026-06-07
CREATE VIEW - Amazon Redshift amazon.com Modified: 2026-06-05 Accessed: 2026-06-07
https://www.blendo.co/amazon-redshift-guide-data-analyst/data-modeling-table-design/table-views/ blendo.co Dead — Check Archive Accessed: 2026-05-23

Revision #10 Last Updated: 2018-12-12 00:49