Materialize builds on top of differential dataflow, which is a differential computation framework for timely dataflow, introduced in Naiad system from Microsoft Research.
Before the v0.27 release, Materialize mainly supports in-memory computation. Materialized views are only maintained in memory. It requires the data source to persist and replay the changes after restarting Materialize. On October 12, 2022, Materialize 0.26 LTS entered long-term-support mode and 0.27 debuted as the cloud-native release of Materialize. It added support for source data persistence, materialized views with data replicated and persisted within cluster, and other changes for the cloud platform.
Materialize is licensed under BSL license. Each version of Materialize will be transformed to Apache 2.0 License 4 years after its release.
Differential dataflow stores changes of a sequence of data (collection) by key and time in an in-memory index called arrangement, so that operators can query the changes of value for a given key in a period of time.
The index is maintained in-memory and provides multi-version access to the data.
Materialize supports strong consistency stream processing, and two isolation levels: serializable and strict serializable.
As a streaming system, Materialize ingests data with timestamps and do not directly handle transactions. Upstream systems like Postgres CDC will send data in transition commit order. Data are processed in order of the row's ingestion timestamp, or virtual time.
Materialize supports strong consistency in terms of consistency in streaming processing, which means the result stored in materialized views should be the same as executing a query over the collection of the source data at a given time. In contrary, an eventual consistency streaming system may produce a materialized view which does not correspond to a query over the source data collection at a given time.
If a user creates non-streaming source (i.e., a table) in Materialize, strict serializable isolation level guarantees reads from materialized views always reflect the insertions, by waiting the changes to propagate to all materialized views. Otherwise, serializable level only guarantees a user can read a consistent snapshot over all materialized views.
Differential dataflow only supports index joins. If a user requests to join two tables, differential dataflow will first create an index (arrangement) over the two tables using join columns as the index key, and then perform index joins.
Materialize supports joins by using index joins in differential dataflow.
If user has already created all required indexes for a multi-way join in advance, Materialize will use a special kind of join strategy called delta join to execute the query. Delta join supports joining multiple tables, does not maintain any additional data structures and therefore achieves no additional storage and memory cost.
Intra-Operator (Horizontal) Inter-Operator (Vertical)
Timely dataflow supports running different operators in the same query plan on different workers, therefore achieving inter query parallelism. At the same time, it also shards a logical operator to multiple physical ones, so as to operate on independent parts of the data concurrently.
Materialize supports Postgres SQL with custom APIs. For example, CREATE CONNECTION, CREATE SOURCE and CREATE SINK are used to connect Materialize to external sources and sinks (i.e., Kafka, Postgres CDC). SUBSCRIBE allows users to retrieve streaming updates from the Postgres wire protocol.
Materialize supports storing data to different storage backends, including S3, filesystem, and in-memory backend (for testing).
Materialize persists data in the form of key-value pairs.
As of Materialize 0.48, the storage component is only used for persisting sources and materialized views. Materialize has a distributed key-value storage service, and the compute workers can read data from and write data to the storage service. Indexes and internal states are only maintained in memory.
Virtual Views Materialized Views
Materialize supports two kinds of views: temporary views and materialized views.
Users can create temporary views by using
CREATE VIEW statement. It simply stores the SQL query in the system and can be used in other queries within the session.
Users can create materialized views by using
CREATE MATERIALIZED VIEW. The views are incrementally maintained and reflect real-time updates. The content of materialized views are persisted into the storage service and users can directly query the materialized views to get latest data. Users can also subscribe materialized views to get real-time updates or sink the data to third-party services.