Compass is the precursor to ElasticSearch, created by Shay Banon in 2004. In the release of its 3rd version, Banon rewrite big parts of Compass to "create a scalable search solution". A solution built from the ground up to be distributed and used a common interface, JSON over HTTP. Shay Banon released the first version of Elasticsearch in February 2010. Elasticsearch BV was founded in 2012 to provide commercial services and products around Elasticsearch and related software. In March 2015, the company ElasticSearch changed their name to Elastic.
By default, Logstash uses in-memory bounded queues absorbs bursts of events and buffer them on disk. Persistent queues provide durability of data within Logstash for Elastic systems. When it's enabled, Logstash will store events on disk, commit to disk using checkpointing. The persistent queue has two kinds of pages: head pages and tail pages. There is only one head page, when head page is of a certain size, it becomes a tail page. Tail page is immutable and head page is append only. When recording a checkpoint, Logstash will call fsync on the head page and atomically write to disk the current state of the queue. The process of checkpointing is atomic, any update to the file is saved if successful. If Logstash is terminated or there is a hardware-level failure, any data that is buffered in the persistent queue but not yet checkpointed is lost.
Two-Phase Locking (Deadlock Detection)
Elasticsearch does not support ACID transactions for changes involving multiple documents, changes to individual documents are ACIDic. If your main data store is a relational database, and Elasticsearch is simply being used as a search engine or as a way to improve performance, then ACID transactions is dealt with in the relational database. If you are not using a relational store, these concurrency issues need to be dealt with the Elasticsearch level. The three practical solutions used by Elasticsearch are Global Locking, Document Locking, Tree Locking, with increasing fine-grained lock level. Each of them is kind of two-phase locking. Global Lock will block the entire storage system to enable only one writer at a time. Document Locking will lock for all involved files. Tree Lock will lock only a directory.
Elasticsearch is a document oriented distributed database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Elasticsearch design mappings and store the document in a way that is optimized for search and retrieval. They are excellent for write-once-read-many-workloads. Like many other document oriented databases, Elasticsearch don't have constraints on data.
Elasticsearch target at text search, so different with most relational database index implementations. Elasticsearch use inverted index as its basic index structure. An index term is the unit of search. It turns everything to look like a string prefix problem. To favor search speed, Elasticsearch will compact the index because when searching over a smaller index, less data needs to be processed, and more of it will fit in memory. But there is also trade-off since compactness means sacrificing the possibility to efficiently update them. An Elasticsearch index is made up of one or more shards, which can have zero or more replicas. These are all individual Lucene indexes, which in turn is made up of index segments.
Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. Instead, Elasticsearch offers two forms of join which are designed to scale horizontally, nested query, has_child and has parent queries. Nested query utilized similar idea of nested loop join, Documents may contain fields of type nested. These fields are used to index arrays of objects, where each object can be queried (with the nested query) as an independent document. Has_child and has_parent queries use hash join to return docs match parent in child or docs match child in parent within a single index.
Maybe different from relational database systems. Logging in Elasticsearch is supported by Log4j. It's event based logging. Elasticsearch allows you to update the logging settings dynamically. Its logs are used for analysis more than recovery. For data resiliency, Elastic stack use the checkpointing features introduced above.
Distributed search execution has to consult a copy of every shard in the indices we're interested in to see if any matching documents. After finding all matching documents, results from multiple shards must be combined into a single sorted list before the search API can return a "page" of results. Elasticsearch is executed in a two-phase process called query then fetch.
Elasticsearch provides the search API allows you to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter, or using a request body. It's a RESTful service. You can use either URI search or Request Body Search. The search API contains advanced features like suggesters, count API, Validate API, Explain API, Profile API, etc.
Elasticsearch does not rely on special hardware like GPU or FPGA. Documents are stored in disk. Elasticsearch uses Lucene under the hood to handle the indexing and querying on the shard level. The files in data directory are written by both Elasticsearch and Lucene. Lucene is responsible for writing and maintaining the Lucene index files while Elasticsearch writes metadata related to features on top of Lucene.
Elasticsearch is document based database. It stores a record in a whole document. The document is a JSON object, all attributes are stored together in that object. In Elasticsearch, the term document has a specific meaning. It refers to the top-level, or root object that is serialized into JSON and stored in Elasticsearch under a unique ID.
https://github.com/elastic/elasticsearch
https://www.elastic.co/guide/index.html
Elasticsearch BV
2004
Compass