Neptune

Amazon Neptune is a fully-managed graph database service used to work with highly connected datasets. It supports multiple graphs, including Property Graph and W3C's RDF, along with their respective query languages Apache TinkerPop, Gremlin, and SPARQL. Neptune is highly available including read-only replicas, point-in-time recovery, and continuous backup to Amazon S3.

History

Amazon Neptune was announced on November 29, 2017 by Amazon Web Services with a limited preview of the service. On May 30, 2018, Neptune became fully available. Due to use cases involving private data, Neptune because HIPAA eligible on September 12, 2018 and complied with the Payment Card Industry Data Security Standard on December 12, 2018.

Compression

Naïve (Page-Level)

Neptune supports compression of single files using the gzip format.

Concurrency Control

Multi-version Concurrency Control (MVCC)

Read-Only queries are evaluated under snapshot isolation. That is, read-only queries operate on a single consistent snapshot of the database which is taken right when the query begins. Snapshot isolation is achieved via multiversion concurrency control and guarantees that dirty reads, non-repeatable reads, and phantom reads do not occur. Read-Only queries may be performed on read replicas causing a small replication lag between the given query results and what the result should be.

For Mutation Queries (i.e. write queries), Neptune locks records and ranges of records when reading data. This ensures consistency of data.

Data Model

Graph Triplestore / RDF

Neptune uses four-position (quad) element called a Neptune Quad. A Neptune quad is composed of a subject, predicate, object, and a graph identifier. A quad describes a relationship between two resources or describes some property about a resource. For example, an edge is described by a quad and so is each property of a node. A graph is a set of quad statements with the same graph identifier.

Foreign Keys

Supported

The edges/relationships in the graph are foreign keys.

Indexes

B+Tree

Neptune maintains three indices on quads:

  • SPOG - Key composed of Subject + Predicate + Object + Graph
  • POGS - Key composed of Predicate + Object + Graph + Subject
  • GPSO - Key composed of Graph + Predicate + Subject + Object

In other words, there are three indexes whose keys are composed of the different orderings of the quad variables. Each of the 16 access patterns has a corresponding index. For example, the access pattern ???? (which means no constraints so return every quad) will use the key SPOG. On the other hand, ?P?G (which means the predicate and graph identifier are constrained but the object and subject aren't) will use the index GPSO.

Amazon Neptune uses Hash Tables for its indexes.

Isolation Levels

Read Committed Snapshot Isolation

Read-Only Queries are evaluated under Snapshot Isolation. Mutation Queries (i.e. write queries) are executed under Read Committed isolation.

Joins

Hash Join Sort-Merge Join

Neptune has four operators related to joins: HashIndexBuild, HashIndexJoin, MergeJoin, and PipelineJoin.

A HashIndexBuild creates a hash index from either a downstream operator or set of quads. HashIndexJoin takes incoming solutions from a downstream operator and joins them with the result of a specific previous HashIndexBuild.

A MergeJoin takes in multiple sets and outputs their collective join.

A PipelineJoin takes the output of a downstream operator and joins them against a specified pattern.

Query Compilation

Code Generation

Neptune can process Gremlin and SPARQL queries. Gremlin queries can be processed into a series of TinkerPop steps. While these TinkerPop steps produce the correct results, they are inefficient on large graphs. Instead, Neptune tries to convert these steps into custom NeptuneGraphQuerySteps. If a TinkerPop step cannot be converted, Neptune will stop the conversion. That step and all subsequent ones will just execute as TinkerPop steps. Neptune will use query optimizers to rewrite the query plan using static analysis and estimated cardinalities. Finally, Neptune will create a pipeline of physical operators.

Query Interface

SPARQL Gremlin HTTP / REST

Neptune supports both Gremlin and SPARQL. Specifically, Neptune supports three Gremlin variants:

  • Gremlin-Groovy
  • Gremlin-Java
  • Gremlin-Python

The queries for both Gremlin and SPARQL can be sent via HTTP REST

Storage Architecture

Disk-oriented

Neptune utilizes Amazon S3 to store data.

Storage Model

Custom

Neptune stores data as a Neptune Quad.

System Architecture

Shared-Disk

Data is stored in a cluster volume which is a single, virtual volume on solid-state disks. Amazon allows the data to be replicated on up to fifteen copies. These replicas can perform read-only queries while the primary instance can serve read/write queries. If the primary instance fails, one of the replicas is promoted to being the new primary instance.

People Also Viewed