etcd is a distributed key-value store which is highly available, strongly consistent, and watchable for changes. The name "etcd" was from a unix's configuration directory, "etc" and "d"istributed system.

There are two major use cases: concurrency control in the distributed system and application configuration store. For example, CoreOS Container Linux uses etcd to achieve a global semaphore which avoids that all nodes in the cluster go to reboot at the same time. Also, Kubernetes use etcd for their configuration store.


CoreOS released etcd in 2013. Originally, the etcd was developed to manage a cluster of CoreOS Container Linux. In 2014, Google launched the Kubernetes project and used etcd for their configuration store. In 2016, CoreOS announced etcd3 and changed their data structure from a tree model to a flat key space. In 2018, RedHat announced to acquire CoreOS, and IBM announced to acquire RedHat in the same year.


Command Logging

etcd appends committed commands which are determined by Raft algorithm. Since the etcd uses gRPC for the query interface, etcd logs the gRPC command in their log.

System Architecture


The etcd cluster is composed of shared-nothing nodes. The cluster has one leader node and other nodes work as followers, which will be determined at run-time (Raft algorithm). When the leader node receives a request, the leader takes votes against all followers. If the majority of nodes agrees on the request, the leader commits the request and ask followers to commit. An etcd client does not need to know which node is a leader to send a request. Instead, the client can send a request to any node in the cluster, and the node will forward the request to the leader node if the node is a follower.

Storage Model

N-ary Storage Model (Row/Record)

etcd stores physically data as a key-value pair. The key is consist of a 3-tuple: major, sub, type. Major contains the revision (a counter which is incremented when data modification is requested.) Sub contains the identifier among the revision because the transaction might produce a single revision with multiple keys. Type is an optional and one use case is for a tombstone. The value contains a delta from a previous version.

Query Compilation

Not Supported

Isolation Levels


etcd provides the Serializable isolation with MVCC. Since each data contains a revision, the etcd aborts or reties the transaction which contains older revision than the revision in the data.

For example, an etcd client 1 started a transaction and got {"a": 1} with the revision 1. After that, the etcd cluster updated the data with revision 2 ({"a": 2} ) due to the client 2's request. When the client 1 requests to update the data, the etcd cluster aborts or retries the request because the client 1 tries to modify the data which has an older revision than the cluster has.

Concurrency Control

Multi-version Concurrency Control (MVCC)

etcd uses MVCC for the concurrency control. The etcd uses revision which is corresponding to a version of MVCC and each key-value contains two revisions which respectively represents when the key-value was created and when the key-value was updated. The etcd cluster maintains the current revision. When the mutative operation has arrived (e.g., Put, Delete, Txn), the etcd assigns the revision to the data related to the operation and updates the current revision.

Storage Architecture


etcd stores a key-value pair in a persistent disk as the b+ tree structure sorted by a key.

Data Model


Data Model of etcd is a key-value pair and both a key and a value must be binary. There is no fixed size limit for the key and the value, but since there is the limit for the request size (in default, 1.5 MiB), the acceptable size of the key and the value is determined by the limit.

In addition to a key and a value, each data has following metadata: create_revision, mod_revision, version, and lease. The creation_revision stands for the creation time of the data, and the mod_revision stand for the updated time of the data. The revision works like a global counter which is incremented when any data is changed, while the version works like a local counter which is incremented when the data is changed. The older version of the data can be retrieved by specifying the revision unless the version is not compacted. The lease is used for the data which has a specific lifetime, and after the lease time elapsed, the data will be removed and not be accessible.



etcd provides a snapshot to improve the recovery speed and avoid increasing logs. The etcd automatically creates a snapshot based on the number of committed transactions from the last snapshot, which is configurable, while the user can create the snapshot anytime via etcdctl command. The etcd acquires a global latch to produce a snapshot, so the high frequency for taking the snapshot will degrade the performance of the database operation.



etcd creates a secondary in-memory btree index for keys to accelerate range operations (e.g. GET and DELETE). The key of the btree uses a key of data and the value of the btree is a pointer to a persistent b+tree.

Stored Procedures

Not Supported