Delta Lake

Viewing Revision #8 from 2026-06-16 10:15 View Current

Delta Lake is an open-source storage layer for big data workloads. It provides ACID transactions for batch/streaming data pipelines reading and writing data concurrently. Developed from Databricks, it is highly compatible with Apache Spark API and can be incorporated on top of AWS S3, Azure Data Lake Storage, or HDFS.[01][05]

Logo Versions

Website: https://delta.io[01]
Source Code: https://github.com/delta-io/delta[02] Accessed: Jun 24, 2026 Last Commit: Jun 24, 2026
Tech Docs: https://docs.delta.io/index.html[03]
Twitter: @DeltaLakeOSS
Developer: Databricks, Inc.
Country of Origin: US
Start Year: 2019 [21]
Coding Agents: Claude [22]
Cursor [23]
Project Types: Commercial, Open Source
Written in: Scala
Supported Languages: Java, Python, Scala, SQL
Compatible With: Spark SQL
License: Apache v2
Wikipedia: https://en.wikipedia.org/wiki/Delta_Lake_(Analytics)[04]

Database Entry

Delta Lake

Viewing Revision #8 from 2026-06-16 10:15 View Current

AI-Assisted

History[01]

Delta Lake is developed by Databricks in 2019, aiming to build a simple data pipeline unifying batch and streaming workloads. Delta architecture beyond Lambda architecture handles problems and bottlenecks in data flow systems.

Checkpoints[06]

Non-Blocking Consistent

Any changes to Delta Lake are stored in ordered, atomic commits in the transaction log. Each commit generates a JSON file. For every 10 commits, Delta Lake will automatically do a checkpoint by combining previous JSON files into a parquet file.

Delta Lake maintains an increasing sequence number for JSON files in the transaction log, and takes a checkpoint asynchronously by getting the sequence number atomically and only scan the previous commit file for checkpointing. New writes will be written into a new commit file with a higher sequence number.

Compression[07][08][09]

Dictionary Encoding Run-Length Encoding Bit Packing / Mostly Encoding

Delta Lake stores data in Apache Parquet format and it can use the efficient compression and encoding schemes that are native to Parquet. Users can specify whether the cached data be stored in a compressed format.

Concurrency Control[10][11][06]

Multi-version Concurrency Control (MVCC) Optimistic Concurrency Control (OCC)

Delta Lake supports table-level transactions with ACID. Specifically, it provides serializable ACID Writes via Optimistic Concurrency Control and natively supports Snapshot Isolation for Reads via MVCC, which maintains different versions of the metadata file and does not remove old data files from disk until users do vacuum. Delta Lake does not support multi-table transactions.

Data Model[12][13]

Column Family / Wide-Column

Delta Lakes is a storage layer that adopts column-oriented storage by embedding Parquet regardless of the choice of data model. Since DataFrame in Apache Spark contains a schema, Delta Lake supports schema enforcement and schema evolution.

Foreign Keys[11]

Not Supported

Delta Lake does not support foreign keys.

Indexes[14][15]

Hash Table

Delta Lakes have indexes on the table level stored as key/value metadata in JSON. To speed up reads, Delta lake can partition columns (of low cardinality) into separate files and look up the data via the metadata of the column and partition. If the column has a high cardinality, users could specify Z-ordering (multi-dimensional clustering) for optimization.

Isolation Levels[10][16]

Serializable Snapshot Isolation

Delta Lake provides serializable writes and snapshot isolation for reads. Delta Lake on Azure Databricks supports serializable, the strongest serialization level.

Joins[17][18]

Not Supported

Delta Lake does not support joins natively. But it provides parameters/hints of range join and skew join for the upper layer to tune. However, Delta Lake supports Merge and uses partition pruning to optimize that.

Logging[10][06]

Physiological Logging

Delta Lake records operations into its transaction logs that will be directly stored to disk. Delta Lake never allows writers to overwrite any log files. Writers can delete a file by appending logs with tombstones. In Delta Lake, transaction logs record the actions done to tables and also file paths to that table or column.

Query Compilation

Not Supported

Storage Architecture[19]

Disk-oriented

Delta Lake is disk-oriented and even delta cache can be stored on disk with less negative impact given high read speeds of modern SSD. In contrast, Spark cache uses memory.

Storage Format

Parquet

Storage Model[12]

Decomposition Storage Model (Columnar)

Delta Lake uses versioned Parquet files which is columnar storage.

System Architecture[20]

Shared-Disk

Delta lake provides a storage layer on top of AWS S3, Azure Data Lake Storage, or HDFS, while Spark is a scalable compute engine for batch/streaming workloads. The storage layer and compute layer are decoupled.

Citations

23 sources

Home | Delta Lake delta.io Accessed: 2026-06-04
GitHub - delta-io/delta: An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs · GitHub github.com Accessed: 2026-06-04
Welcome to the Delta Lake documentation | Delta Lake delta.io Accessed: 2026-06-05
https://en.wikipedia.org/wiki/Delta_Lake_(Analytics) wikipedia.org Dead — Check Archive Accessed: 2026-06-04
https://databricks.com/product/delta-lake-on-databricks?gclid=CjwKCAiA8ejuBRAaEiwAn-iJ3gOU0cZhij3lUhcKL113WpEKrEtm42_SeAcJpmR8NbVhPaICnEnUZBoCL1wQAvD_BwE&utm_campaign=6448987615&utm_content=delta&utm_medium=cpc&utm_offer=delta-lake-on-databricks&utm_source=google databricks.com Dead — Check Archive Accessed: 2026-05-29
Understanding the Delta Lake Transaction Log - Databricks Blog databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Lakehouse Storage | Databricks databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Apache Parquet - Wikipedia wikipedia.org Modified: 2026-04-18 Accessed: 2026-06-04
https://docs.azuredatabricks.net/delta/optimizations/delta-cache.html#configure-disk-usage azuredatabricks.net Dead — Check Archive Accessed: 2026-05-29
delta/PROTOCOL.md at master · delta-io/delta · GitHub github.com Accessed: 2026-05-29
https://docs.delta.io/delta-faq delta.io Accessed: 2026-06-07
What is Delta Lake in Databricks? | Databricks on AWS databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Schema Evolution & Enforcement on Delta Lake - Databricks databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Best practices: Delta Lake | Databricks on AWS databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
What is Delta Lake in Databricks? | Databricks on AWS databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Isolation levels and write conflicts - Azure Databricks | Microsoft Learn microsoft.com Modified: 2026-03-09 Accessed: 2026-06-07
Optimization recommendations on Databricks | Databricks on AWS databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
How to improve performance of Delta Lake MERGE INTO queries using partition pruning - Databricks databricks.com Accessed: 2026-06-07
Optimize performance with caching on Azure Databricks - Azure Databricks | Microsoft Learn microsoft.com Modified: 2026-01-17 Accessed: 2026-06-07
Lakehouse Storage | Databricks databricks.com Modified: 2026-06-05 Accessed: 2026-06-07
Project import generated by Copybara. GitOrigin-RevId: 62fbd592bead1a3592c258a3191e3e603b026377 github.com Modified: 2019-04-22 Accessed: 2026-05-27
https://github.com/delta-io/delta/commit/89681ebe340ddb39a9fff20c98ac1febdb2f2e14 github.com Modified: 2026-04-29 Accessed: 2026-06-25
https://github.com/delta-io/delta/commit/e6e525e5ada8ee0c5f20e007c0b48d84821ad2f5 github.com Modified: 2026-03-10 Accessed: 2026-06-25

Revision #8 Last Updated: 2026-06-16 06:15