FAQ - General questions around transaction handling in Aerospike during cluster size changes


#1

FAQ - Generation questions on write transactions handling in Aerospike

Detail

This knowledgebase article covers some details on how Aerospike handles transactions in a distributed cluster environment.

Background

In general (and by default[2]), Aerospike reads are executed against the node holding the master copy of the partition the record belongs to. Similarly, writes/updates are processed against the node holding the master copy of the partition the record belongs to, and a prole write is issued on the node holding the first replica copy of that partition.

Aerospike clients[1] maintain a partition map which captures the node ownsership for the different replicas copies of each of the 4096 partitiion for each namespace.

Being in a distributed environment, Aerospike clients often witness situations where the partition map is updated and needs to be re-fetched. This triggers some windows where exceptions such as timeouts may occur.

Disclaimer In general, it is recommended to wait for migrations to complete between planned node service restarts to avoid potential data update losses, unless the write traffic is shutdown during such periods.

Common case - Aerospike node restarts

For example, let’s consider the case of a rolling restart (say to update the Aerospike version). As described below, there are several situations that can occur:

  • A read is in progress to the node going down, the cluster has not detected the node is down and the client still has the old partition map.
  • A write to the master is in progress to the node going down, the cluster has not detected the node is down and the client still has the old partition map.
  • A write to the prole is in progress to the node going down, the cluster has not detected the node is down and the client still has the old partition map.
  • A read is in progress to the node going down, the cluster has detected the node is down but the client still has the old partition map.
  • A write to the master is in progress to the node going down, the cluster has detected the node is down and the client still has the old partition map.
  • A write to the prole is in progress to the node going down, the cluster has detected the node is down but the client still has the old partition map.

Aerospike behavior

A direct answer for any of the above situation is that any pending transactions to a node that is not responding (Aerospike service stopped or any hardware/network failure deeming the node as temporarily or permanently out of the cluster) will get an error/exception of TIMEOUT on the client side, irrespective of whether the cluster already realized the node departure. Also, the particular socket used for those transactions will be closed.

In general and by default, when a write transaction TIMEOUT’s, the record is left in an UNKNOWN state. It may have succeeded on neither copies, on master only, or on both master and prole. There is no way of telling if the node succeeded when it went down. The only guarantee is the ordering - records first have their master copy updated followed by their prole copy.

More information on understanding timeout exceptions:

Usefulness of retries

Q: Which of the above scenarios will be helped if there is a retry in the client policy?

A: In general, it is best not to retry immediately, and have some exponential back-off, to avoid unecessarily overwhelming the network and the other nodes in the cluster.

Node joining a cluster

Q: Will any exceptions be thrown when a node comes back up and re-joins a cluster?

A: No. When the node comes back, new socket connections would be established, and new transaction on the new socket will succeed.

Details on flow of events during cluster size changes on server and clients

A little more detail on steps and timing when a node departs a cluster:

On server side -

  • The cluster sees the node departure (typically heartbeat failures).
  • Paxos is restarted to agree on a new succession list (ordered list of live nodes in the cluster).
  • Paxos principal node sends the succession list to all non-principal nodes.
  • Upon receiving the message, each node independently updates its partition map, then starts rebalance (migrations). Note that the timing of this can be different for each node. But the algorithm is the same, so that everyone will generate the same map, since they start with the same succession list.

Independently on the client side -

  • Every 1 second, the client will serially ping each of the server nodes, and check if the partition map has changed. The ‘cluster_key’ is what the clients use today to check for cluster state changes.
  • If the a node confirms a cluster state change, the client will fetch a new partition map from the node.
  • The new partition map will be returned, with new partition ownerships. Note - the map did not have to be returned back from the new owner. Just whichever node happen to be first being pinged.

References

[1] Data Partitions: http://www.aerospike.com/docs/architecture/data-distribution.html

[2] Transaction Consistency: http://www.aerospike.com/docs/architecture/consistency.html

Keywords

Write read transaction

Timestamp

06/14/2016