Espresso

Espresso is an internal distributed document-oriented database management system written by LinkedIn. It serves as the source-of-truth primary store for many downstream systems and applications such as Company Pages, Unified Social Content Platform, and InMail. Because of its special role, the design principles of Espresso mainly focus on satisfying the requirements of production environments, which include but not limited to the guarantee of operability, availability, scalability, and elasticity. For operability, it not only provides flexible data model and APIs to support various applications but also has been designed to be highly compatible with the whole data ecosystem at LinkedIn. For availability, it has elaborately designed fault tolerance mechanism, e.g., there is always a warm standby of the Espresso clusters at a geographically remote disaster recovery data center. For scalability, most of the methods it adopts, including cluster management and data management, avoid centralized processing and synchronized operations. For elasticity, it supports online cluster expansion with little downtime. Besides all those design principles, the overall objective of Espresso is to find the sweet-spot between SQL database systems and NoSQL systems, which is clearly reflected by its design of data model.

Espresso is derived from MySQL(with InnoDB as the storage engine). It has two system-level internal building blocks: a cluster management system and a change capture system. Apache Helix, with Apache Zookeeper integrated in, is used to carry out the first function. LinkedIn Databus was used as the first generation change capture system but was later replaced by Kafka. The change capture system also plays an important role in data replication. Besides, another library-level building block of Espresso is Apache Lucene, which provide basic support fo full-text inverted index.

History

The Espresso project was first planned and designed in early 2011. Its mission at that time was to fill the vacancy that there exists no well-designed highly consistent database systems with both scalability and agility in LinkedIn's data infrastructure. Based on the experience of developing early relational database systems like Voldemort, the engineering team at LinkedIn spent one year writing Espresso and deployed it in production in June 2012.

Although LinkedIn once planed to open source Espresso, the plan was later shelved. Therefore, currently Espresso is still an internal system.

Indexes

Hash Table Inverted Index (Full Text)

Espresso supports hash index on partition key as primary index and a high-performance implementation of inverted index as secondary indexes.

The inverted index of Espresso is adapted from Apache Lucene. There are two types of secondary indexes in Espresso based on their scope: 1) local secondary indexes, which are built within a document group; 2) global secondary indexes, which are built across multiple document groups. Local secondary indexes are optimized by prefix the collection key in front of the indexed terms. This is called Prefix Index. An example of this is:

If the original inverted list has a term Andy, it has a long posting list. After applying Prefix Index, it will be split into several small lists, e.g., Mailbox_Andy and Profile_Andy.

Although this optimization increases the number of lists in the index, it reduces the latency and memory footprint and bound them by the size of working set. The original inverted index of Lucene is implemented in a log-structure manner, which makes the index immutable. Espresso changes it to make it updatable.

The posting lists of Prefix Index are stored in MySQL, which provides additional transactional guarantee. Terms in an inverted index are organized by a B+-Tree, and bitmap indexes are also used to speed up the searching process. But these two indexes are not used as data indexes.

Storage Model

N-ary Storage Model (Row/Record)

Espresso is a document store implemented on MySQL, which means its underline storage model is still a row-store. Although MySQL has provided native supports for document stores since version 5.7.12, it was much later than the start time of Espresso.

Data Model

Document / XML

Espresso adopts a hierarchical document-oriented data model, where documents belong to both different tables and document groups, tables and document groups then belong to databases. Different level of this hierarchy have different schema-define format, e.g., database and table schemas are defined in JSON but document schemas are defined in Avro that support online schema evolution. Since document group is a logical concept and has no explicit representation, it does not have schemas.

Database is the largest unit of the data model. Table and document group are the two lower parallel units in the hierarchy, but they are quite different from each other. A table explicitly contains some documents that have the same schema, which defines the complete key structure of tables. But a document group does not actively own any document; on the contrary as long as two documents have the same partitioning key, they belong to the same document group. A table can therefore span multiple document groups and a document group can also span multiple tables. The most basic unit of the data model is document. A document contains abundant schema-ed contents with various data structures. It's conceptually similar to a row in SQL database systems. Different documents can be identified by their primary keys.

There is another level called collection in the hierarchy of Espresso. Similar to document groups, collections are not explicitly represented by schemas; on the contrary, any documents that have the same partial keys are in the same collections. To make things clear, below is an example of this data model:

Assume MailboxDB is a database name, Messages is a table name. The schema of the Messages table defines that all its documents have key structure <MailboxID>/<MessageID>, where <MailboxID> is partition key. Then we have:

  • The complete key /MailboxDB/Messages/100/1 uniquely identifies a document.
  • Document /MailboxDB/Messages/100/1 and document /MailboxDB/Messages/100/2 are in the same collection, /MailboxDB/Messages/100.
  • If MailboxStats is another table and has the same schema as Messages, then Document /MailboxDB/Messages/100/1 and document /MailboxDB/MailboxStats/100/5 are in the same partition.

Although document group is a passive component in the data model, it plays a quite important role in data management, e.g., secondary indexes are mainly built inside a document group and the documents of the same document group are usually stored in the same node. Espresso tries to obtain the benefits from both relational database systems and NoSQL database systems by leveraging the hierarchical document data model. For example, transactional operations are supported for at most all the documents in a document group, unlike usual cases where document-oriented database systems only support transactional operations on single document. On the other hand, such a data model where documents only have loose relationships makes Espresso flexible.

System Architecture

Shared-Nothing

Concurrency Control

Multi-version Concurrency Control (MVCC) Two-Phase Locking (Deadlock Detection)

Espresso relies on MySQL to provide transaction supports. So its concurrency control schemes are identical to MySQL's. Because Espresso only provides transactional actions within a partition (a.k.a. document group) and all the documents belonging to the same partition will be stored on the same node, it does not supports distributed transactions.

Espresso leverages a master-slave model. The consistency model between the primary partition on master node and its replicas on slave nodes is timeline consistency model. It requires the commit order on slave nodes to be the same as what it is on master node. The order are defined by a global timestamp called system change number (SCN).

Logging

Physical Logging

Espresso's logging scheme is a variant of MySQL's binary logging. It's basically a kind of physical logging and is responsible for data replication. The original version of MySQL's binary logging does not support replication of partial logs, which makes operations like online cluster expansion difficult. Espresso adds a partition-specific sequence number, system change number (SCN), to each binary log record so that logs can be partially replicated and sent to different data shards.

Isolation Levels

Read Uncommitted Read Committed Serializable Snapshot Isolation Repeatable Read

Espresso relies on MySQL to provide transaction supports. So its isolation levels are also identical to MySQL's.

Query Interface

HTTP / REST

Espresso adopts REST-style APIs for the purpose of ease-to-use and ease-to-integrate. It supports 1) read via primary keys or secondary indexes; 2) partially update, fully update, and insert; 3) filter predicates based on timestamp and document CRC; 4) register for monitoring change stream; 5) bulk load and export for offline analysis purpose.

All the operations above can be applied transactionally within the range of a partition, i.e. a document group.

Storage Architecture

Disk-oriented

Espresso's storage engine is MySQL's InnoDB. Therefore Espresso directly inherits the disk-oriented storage architecture from InnoDB, and relies on its functions to do buffer pool management, etc.