How does Aerospike client find a node


#1

Synopsis

Customer often ask the process behind a client connecting to a particular node.

Resolution

Data Distribution via a Hash (Aerospike Smart Partitions)

In order to understand how a client knows about the nodes of a cluster we need to understand how data and traffic are distributed.

Aerospike uses a very random hash (the RIPEMD 160) to make sure that both data volume and traffic are evenly distributed. This occurs automatically and without the need for manual intervention.

To determine where a record should go, the record key (of any size) is hashed into a 20-byte fixed length string using RIPEMD160, and the first 12 bits form a partition ID which determines which of the partitions should contain this record. The partitions are distributed equally among the nodes in the cluster, so if there are N nodes in the cluster, each node stores approximately 1/N of the data.

“Partitions” are buckets of records that have been grouped together for the purpose of distribution.

In order to evenly divide data between different nodes, all data is mapped to one of 4096 partitions, based on the hash (digest) value. In turn each of these partitions is mapped to the different nodes. If the number of nodes change, the partitions will get remapped and transferred to the appropriate location in a process called “migration.”

Partition Map Every record in an Aerospike cluster will be mapped to one of 4096 partitions. The basis for this mapping is the hash of the key. 4 bytes of the hash are used to determine to which partition any given key belongs. These partitions are then divided among the nodes in the cluster in a partition map.

Because data is distributed evenly (and randomly) across nodes, there are no hot-spots or bottlenecks where one node handles significantly more requests than another node. To determine where a record should go, the record key (of any size) is hashed into a 20-byte fixed length string using RIPEMD160, and the first 12 bits form a partition ID which determines which of the partitions should contain this record. The partitions are distributed equally among the nodes in the cluster, so if there are N nodes in the cluster, each node stores approximately 1/N of the data.

There is no need for manual sharding. The nodes in the cluster coordinate among themselves to divide the partitions. The client detects cluster changes and sends requests to the correct node. When nodes are added or removed, the cluster automatically re-balances. All of the nodes in the cluster are peers – there is no single database master node that can fail and take the whole database down.

When the database creates a record, a hash of the record key is used to assign the record to a partition. Hashing is deterministic – that is, the hashing process always maps a given record to the same partition. Data records stay in the same partition for their entire life. Partitions may move from one server to another, but partitions would not normally split or reassign a record to another partition.

Info Protocol

Aerospike Client APIs will track cluster state changes. (addition, removal of nodes) At any given instant, the client uses the info protocol to communicate periodically with the cluster and maintain a list of nodes that form the cluster. It also uses the Aerospike Smart Partitions™ algorithm to determine which node stores a particular partition of data. Any changes to the cluster size are tracked automatically by the Client, and such changes are entirely transparent to the Application. In practice, this means that transactions will not fail during the transition, and the Application does not need to be restarted during node arrival and departure.


Primary Index
Accessing Forwarding Records From UDF
#3

As we know, the first 12 bits form a partition ID. So I wonder why it says '4 bytes of the hash are used to determine to which partition any given key belongs. ’


#4

Yes, @Kai_Guo is correct. Thanks for letting us know about this error, will see that this is corrected.


#5

In fact, we can say there are two mapping between a record and a node. The first is from record key to partition id by a very random hash (the RIPEMD 160) while the second is from partitions to nodes which puzzles me.

In Aerospike documents(Data Distribution), it says

Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm.

In my view, this Smart Partition seems like this : if there are 3 nodes in cluster, then maybe node 1 has partition 0, 3, 6, …, node 2 has partition 1, 4, 7,…, node 3 has partition 2, 5, 8, … In this way, partitions are distributed relatively evenly.


#6

Kai,

Your understanding is correct. The part of record key hash (RIPEMD160) determines the partition id where the record should be written/read. The partition map is looked up to determine which node is master and replica for the specific partition id so that the read/write operation could be performed on the appropriate node & partition.

The record key hash is calculated for every read/write operation. The partition map could change with every cluster state change (node being added or removed from cluster) and hence the partition map is referred to find out which node the partition id belongs to.

-samir