Mechanism for distribution of partitions among nodes

nOOb · September 12, 2015, 2:50pm

Hi,

I want to ask how the DHT is distributed among nodes.What is the mechanism by which all the partitionsID are evenly distributed among nodes ?

manigandham · September 13, 2015, 4:47am

The documentation has answers:

http://www.aerospike.com/docs/architecture/data-distribution.html

samir · September 14, 2015, 4:34am

In addition to Manigandham’s link, you could as well look at the source code @ aerospike-server/partition.c at master · aerospike/aerospike-server · GitHub which implements the partition map and master/replica relationship for nodes. The code is fairly detailed comments as well for easy understanding.

-samir

Kai_Guo · September 14, 2015, 8:42am

I have looked at the source code implementing the partition map and master/replica relationship for nodes. I also wrote a few lines of codes to output the number of partitions distributed on every nodes.

I noticed the secret of distribution evenly is the two hash algorithm of fnv-1a and one-at-a-time. Their random hash values map partitions to nodes magically. What is the magic?

nOOb · September 14, 2015, 9:15am

Excatly what does these two hash algorithm do in order to map partitions to nodes ?

Kai_Guo · September 14, 2015, 9:33am

First, fnv-1a hashes the node id and partition id separately to two values. Then, one-at-a-time hashes the two values together to a new value. You could look at the details at the source code @ https://github.com/aerospike/aerospike-server/blob/master/as/src/fabric/partition.c (line 3138 to line 3177).

manigandham · September 17, 2015, 5:59pm

github.com

aerospike/aerospike-server/blob/master/as/src/fabric/partition.c#L3138-L3177


      
          /*
           * partition.c
           *
           * Copyright (C) 2008-2021 Aerospike, Inc.
           *
           * Portions may be licensed to Aerospike, Inc. under one or more contributor
           * license agreements.
           *
           * This program is free software: you can redistribute it and/or modify it under
           * the terms of the GNU Affero General Public License as published by the Free
           * Software Foundation, either version 3 of the License, or (at your option) any
           * later version.
           *
           * This program is distributed in the hope that it will be useful, but WITHOUT
           * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
           * FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
           * details.
           *
           * You should have received a copy of the GNU Affero General Public License
           * along with this program.  If not, see http://www.gnu.org/licenses/

This file has been truncated. show original

You can select line numbers on github to highlight the exact code. Also this forum will include a preview.

Kai_Guo · September 23, 2015, 5:24am

Hi,

Now, I convinced myself with a explanation on the secret.

In brief, the hash algorithm fnv-1a map partition id and node id(in succession list) into a value space(0~2^64-1), just like the method in consistent hash. Then, the one-at-a-time combines the two hash values above and map the pair into a new value space(0~2^64-1).

For the pairs consisting of 4096 partitions and n(the number of nodes in cluster) nodes, each hash value after two hash algorithm is similar in randomness and fairness.

Randomness means each pair corresponds to a value in value space. The probability of value conflict(the hash advantage) is very very low.

Fairness is for each partition row. I name a partition’s all replica nodes as a partition row. Assume that there are 4 nodes(101, 102, 103, 104) in cluster. For partition 0, its replicas order may be (103, 101, 102 ,104). For partition 100, the order may be (102, 104, 101, 103). In both rows, the order is fair, which means the probability of node 101 becoming the first replica node of partition 0 is the same with the probability of node 102. It happens that partition 0’s first replica is node 101 while partition 100’s first is node 102. In a word, for 4096 partitions, every node is likely to be the first replica. In fact, we also found every node nearly shares the same number of partitions.

The experiment result is as follows.

In this figure, there are 4 nodes (303, 201, 101, 100) in cluster. The replica factor is 2. For the first replica of 4096 partitions, the share is (1022, 1106, 971, 997). For the second replica of 4096 partitions, the share is (1023, 1009, 1045, 1019).

After the two hash, the sort process produces random but fair order for each partition’s all replica nodes.

I hope anybody familiar with it could give any advice on it. Thank you!

Topic		Replies	Views
Partition map logic & partitioning algorithm How Aerospike Works index	6	4472	April 27, 2020
Replication Factor Question How Aerospike Works	1	3311	December 31, 2014
How does Aerospike distribute data/traffic? Feature Discussion	2	2248	October 16, 2014
Uneven partition distribution Operations	3	1745	June 13, 2019
How Aerospike manage partition map? Migration	6	1945	November 11, 2019

Mechanism for distribution of partitions among nodes

Related topics