etcd

etcd is a distributed key-value store which is highly available, strongly consistent, and watchable for changes. The name "etcd" was from a unix's configuration directory, "etc" and "d"istributed system. There are two major use cases: concurrency control in the distributed system and application configuration store. For example, CoreOS Container Linux uses etcd to achieve a global semaphore which avoids that all nodes in the cluster go to reboot at the same time. Also, Kubernetes use etcd for their configuration store.

History

CoreOS released etcd in 2013. Originally, the etcd was developed to manage a cluster of CoreOS Container Linux. In 2014, Google launched the Kubernetes project and used etcd for their configuration store. In 2016, CoreOS announced etcd3 and changed their data structure from a tree model to a flat key space. In 2018, RedHat announced to acquire CoreOS, and IBM announced to acquire RedHat in the same year.

Checkpoints

Blocking

etcd provides a snapshot to improve the recovery speed and save disk space by removing old logs. The etcd automatically creates a snapshot based on the number of committed transactions from the last snapshot, which is configurable, while a user can create the snapshot anytime via `etcdctl` command. The etcd acquires a global latch to produce a snapshot, so the high frequency for taking the snapshot degrades the performance of the database operation.

Concurrency Control

Multi-version Concurrency Control (MVCC)

etcd uses MVCC for the concurrency control. The etcd uses `revision` which corresponds to a version of MVCC. The revision works like a logical clock in the etcd cluster and is updated when any data modification happens. Each key-value contains two revisions which respectively represents the time when the key-value was created and the time when the key-value was updated. The etcd cluster maintains the current revision. When the mutative operation has arrived (e.g., `Put`, `DeleteRange`, `Txn`), the etcd assigns the revision to the data related to the operation and updates the current revision.

Data Model

Key/Value

Data Model of etcd is a key-value pair and both a key and a value are represented as a binary array. There is no fixed size limit for the key and the value, but since there is the limit for the request size (in default, 1.5 MiB), the acceptable size of the key and the value is determined by the limit. In addition to a key and a value, each data has following metadata: `create_revision`, `mod_revision`, `version`, and `lease`. The `creation_revision` stands for the data creation time, and the `mod_revision` stand for the data update time. A revision works like a global counter which is incremented when any data is changed, while a version works like a local counter which is incremented when the data is changed. The older version of the data can be retrieved by specifying the revision unless the version is not compacted. The `lease` is used for the data which has a specific lifetime, and after the lease time is reached, the data will be removed and not be accessible.

Indexes

B+Tree

etcd holds a secondary index on memory and provides getting and deleting key-value pairs within a range key. For the index, etcd uses btree whose key is a key of a key-value pair and whose value is a physical data location.

Isolation Levels

Serializable

etcd provides the serializable isolation by MVCC. Since each data contains a revision, the etcd aborts or reties a transaction which contains an older revision than the revision the etcd holds. For example, an etcd client 1 started a transaction and got `{"a": 1}` with a revision 1. After that, the etcd cluster updated the data with a revision 2 (`{"a": 2}`) due to the client 2's request. When the client 1 requests to update the data, the etcd cluster aborts or retries the request because the client 1 tries to modify the data which has an older revision than the cluster has.

Query Compilation

Not Supported

Query Interface

RPC

etcd has own interface and accepts a request via a gRPC remote procedure call. The interface supports the following operations: `Range`, `Put`, `DeleteRange`, `Txn`, `Compact`. The `Range` operation gets a key-value pair or key-value pairs within a range key. It supports a travel query, so the older version data is retrieved unless it is not compacted. The `Put` operation creates or updates the data. The `DeleteRange` operation removes a key-value pair or key-value pairs within a range key. The `Txn` operation provides an atomic operation for multiple operations. The `Compact` operation removes the older version of data up to a given revision.

Storage Architecture

Disk-oriented

etcd uses disk-oriented architecture and stores log files and snapshots on the disk. If the etcd cluster crashs or stops due to a failure, the cluster can restore the data from the log files and snapshots and restart the service.

Storage Model

N-ary Storage Model (Row/Record)

etcd physically stores data as a key-value pair. The key of the pair contains the revision, while the value of the pair contains a delta from a previous version.

Storage Organization

Log-structured Sorted Files

etcd stores a request in a log file first (write-ahead log) and periodically creates a snapshot file to avoid growing the log file. The snapshot file contains key-value pairs organized by the b+ tree structure, and the file is sorted by the order of keys.

Stored Procedures

Not Supported

System Architecture

Shared-Nothing

The etcd cluster is composed of shared-nothing nodes. The cluster has one leader node, and other nodes work as followers. The leader node is determined at run-time (Raft algorithm). When the leader node receives a request, the leader takes votes against followers. If the majority of nodes agrees on the request, the leader commits the request and ask followers to commit. An etcd client can send a request to any node in the cluster. A follower will forward the request to a leader node if the client sends a request to the follower.