Mechanism for distribution of partitions among nodes


#1

Hi,

I want to ask how the DHT is distributed among nodes.What is the mechanism by which all the partitionsID are evenly distributed among nodes ?


#2

The documentation has answers:

http://www.aerospike.com/docs/architecture/data-distribution.html


#3

In addition to Manigandham’s link, you could as well look at the source code @ https://github.com/aerospike/aerospike-server/blob/master/as/src/fabric/partition.c which implements the partition map and master/replica relationship for nodes. The code is fairly detailed comments as well for easy understanding.

-samir


#4

I have looked at the source code implementing the partition map and master/replica relationship for nodes. I also wrote a few lines of codes to output the number of partitions distributed on every nodes.

I noticed the secret of distribution evenly is the two hash algorithm of fnv-1a and one-at-a-time. Their random hash values map partitions to nodes magically. What is the magic?


#5

Excatly what does these two hash algorithm do in order to map partitions to nodes ?


#6

First, fnv-1a hashes the node id and partition id separately to two values. Then, one-at-a-time hashes the two values together to a new value. You could look at the details at the source code @ https://github.com/aerospike/aerospike-server/blob/master/as/src/fabric/partition.c (line 3138 to line 3177).


#7

You can select line numbers on github to highlight the exact code. Also this forum will include a preview.


#8

Hi,

Now, I convinced myself with a explanation on the secret.

In brief, the hash algorithm fnv-1a map partition id and node id(in succession list) into a value space(0~2^64-1), just like the method in consistent hash. Then, the one-at-a-time combines the two hash values above and map the pair into a new value space(0~2^64-1).

For the pairs consisting of 4096 partitions and n(the number of nodes in cluster) nodes, each hash value after two hash algorithm is similar in randomness and fairness.

Randomness means each pair corresponds to a value in value space. The probability of value conflict(the hash advantage) is very very low.

Fairness is for each partition row. I name a partition’s all replica nodes as a partition row. Assume that there are 4 nodes(101, 102, 103, 104) in cluster. For partition 0, its replicas order may be (103, 101, 102 ,104). For partition 100, the order may be (102, 104, 101, 103). In both rows, the order is fair, which means the probability of node 101 becoming the first replica node of partition 0 is the same with the probability of node 102. It happens that partition 0’s first replica is node 101 while partition 100’s first is node 102. In a word, for 4096 partitions, every node is likely to be the first replica. In fact, we also found every node nearly shares the same number of partitions.

The experiment result is as follows.

In this figure, there are 4 nodes (303, 201, 101, 100) in cluster. The replica factor is 2. For the first replica of 4096 partitions, the share is (1022, 1106, 971, 997). For the second replica of 4096 partitions, the share is (1023, 1009, 1045, 1019).

After the two hash, the sort process produces random but fair order for each partition’s all replica nodes.

I hope anybody familiar with it could give any advice on it. Thank you!