Aerospike's Transaction Durability Guarantees


#1

Durability

Aerospike provides durability using multiple techniques:

  • Persistence by storing data in flash/SSD on every node and performing direct reads from flash
  • Replication within a cluster by storing copies of the data on multiple nodes
  • Replication across geographically separated clusters

Durability is achieved by SSD based storage and replication. The replication to replica (or prole) nodes is done synchronously. I.e., the client application will be informed only after the record is successfully written on the replica nodes. So, even if one node failed for any reason, we will have one more copy in another node thereby ensuring durability of the committed records. This is similar to the data + log file approach of traditional RDBMS where, If a data file is lost, the data can be recovered by replaying the log file. However, in case of Aerospike when one copy of the data is lost on a node, the data need not be recovered. Instead, the latest copy of the lost data is instantly available in one or more replica nodes in the same cluster as well as in nodes residing in remote clusters.

On top of this, Aerospike’s cross data center (XDR) support can be used to asynchronously replicate data to a geographically separated cluster providing an additional layer of durability. This will ensure that all of the data in the Aerospike database survives on a remote cluster even if an entire cluster fails and data is unrecoverable.

Resilience to simultaneous hardware failures

In the presence of node failures, applications written using Aerospike clients are able to seamlessly retrieve one of the copies of the data from the cluster with no special effort. This is because, in an Aerospike cluster, the virtual partitioning and distribution of data within the cluster is completely invisible to the application. Therefore, when application libraries make calls using the simple Client API to the Aerospike cluster, any node can take requests for any piece of data.

If a cluster node receives a request for a piece of data it does not have locally, it satisfies the request by generating a proxy request fetch the data from the real owner using the internal cluster interconnect and subsequently replying to the client directly. The Aerospike client-server protocol also implements caching of latest known locations of client requested data in the client library itself thus minimizing the number of network hops required to respond to a client request.

During the period immediately after a cluster node has been added or removed, the Aerospike cluster automatically transfers data between the nodes to rebalance and achieve data availability. During this time, Aerospike’s internal "proxy" transaction tracking allows high-consistency to be achieved by applying reads and writes to the cluster nodes which have the data, even if the data is in motion.

Offsite data storage and cross data center portability

Aerospike provides online backup and restore, which, as the name indicates, can be applied while the cluster is in operation. Even though data replication will solve most real-world data center availability issues, an essential tool of any database administrator is the ability to run backup and restore. An Aerospike cluster has the ability to iterate all data within a namespace (similar to a map/reduce). The backup and restore tools are typically run on a maintenance machine with a large amount of inexpensive, standard rotational disk.

Aerospike backup and restore tools are made available with full source. The file format in use is optimized for high speed but uses an ASCII format, allowing an operator to validate the data inserted into the cluster, and use standard scripts to move data from one data store to another. The backup tool splits the backup into multiple files. This allows restores to occur in parallel from multiple machines in the case of needing a very rapid response to a catastrophic failure event.


CPU unusually high on one node of 8 node cluster
Understanding aerospike server proxies
#2

Hi @kporter

I had a question about reads during node failure. You have mentioned above that

client-server protocol also implements caching of latest known locations of client requested data in the client library itself .

I guess this caching is the partition map stored at client lib side and which gets updated every 1 second(Please correct if I am wrong).

I am using Java client which as of now doesn’t support prole reads (a recently released version does). If partition map knows the master node of the shard to which that keys belongs and if that master is down and partition map is still not updated, it should have failed in my understanding. But I don’t see any read failures (Write failures are there)

There is no issue that we are facing, but If we could get answer to this question, this would just help in our understanding. Thanks.


#3

I just run the script to update the KB section :sunglasses:.

Yes, I would think so too.

Hm, I wonder if the previous behavior was to pick a node at random to read from, leaving it to the cluster to proxy the request to the appropriate node. If you repeat the test, see if proxy-initiate increases during the window where write failures are occurring.

I will try and get someone to respond with a more definitive answer in the mean time :smile:.


#4

If the master is down and the partition map is still not updated, the read from master will fail the first time. The default policy is to retry after a 500ms sleep. So, the retry will probably occur after the partition map has been updated, resulting in a successful read.


#5

Hi @kporter

Thanks for your detailed reply. Sorry for my delayed response.

I failed one node and checked the proxy_initiate values coming from asinfo command. I don’t see any increase in that. Also, I had read somewhere in AS docs that every request reaches to correct partition in a single hop. So, Java client sending to random node may not be correct. Thanks.

Hi @Brian Thanks for your reply. Sorry, I couldn’t understand that how partition map will get updated in 500ms. Cluster will take 1.5 seconds to realise a node down(with default Interval/Timeout) and then partition map will be updated in 1 seconds. So, worst case it will be 2.5 seconds and best case could be 1 seconds. How can it be < 1s ?Can you please help in my understanding here ? Thanks.


#6

I didn’t mean to suggest this was the general case. I was suggesting there may be a mechanism such as this when a node doesn’t respond. But a quick look at SyncCommand.java and this is not the case.

One possibility is that the thread.sleep call in Util.java is oversleeping. A bit of extra time is expected, this would be quite a bit of extra time though.

From the Java language specification:

Thread.sleep causes the currently executing thread to sleep
(temporarily cease execution) for the specified duration, subject to
the precision and accuracy of system timers and schedulers. The thread does not lose ownership of any monitors, and resumption of execution
will depend on scheduling and the availability of processors on which to execute the thread.

I am able to find evidence that running on a virtual machine may exacerbate this issue, specifically when using VMWare “configured badly”. This could easily be verified by adding another timeout check after we wake from sleeping and then rerun your test.