How Data is Distributed

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Synopsis

Wants to know why their data distribution on their cluster is distributed throughout the cluster.

To determine how data is distributed through the cluster, we will have to look at the cluster.

=== NAMESPACE ===               
ip/namespace                                       Master Objects
aerospike1-12/test-ns                              34,254,650
aerospike1-11/test-ns                              32,121,480
aerospike1-1/test-ns                               31,079,740
aerospike1-7/test-ns                               31,075,307
aerospike1-9/test-ns                               29,874,205
aerospike1-8/test-ns                               29,120,584
aerospike1-10/test-ns                              29,112,686
aerospike1-5/test-ns                               28,685,034
aerospike1-6/test-ns                               27,043,474
aerospike1-2/test-ns                               26,793,969
aerospike1-3/test-ns                               26,620,033
aerospike1-4/test-ns                               25,924,147

To find out if this cluster data is distributed normally. We would need to find out the mean (average), the variance (average of the squared difference from the mean), and the Standard Deviation (a measure that is used to quantify the amount of variation or dispersion of a set of data values).

To find out the mean (average), we take the total Master Object in your cluster 351,705,309 and divide that by the number of nodes which is 12 in this example, which is 29,308,776. Next we calculate each node’s object difference and minus from the mean (average), square it, and then average the result which is 6,323,040,910,769.

Then the Standard Deviation is just the square root of variance, which is 2,514,566.

Now we can show which nodes are within one Standard Deviation (2,514,566), two Standard Deviation (5,029,131), to three Standard Deviation (7,543,697).

1st standard deviation = 2,514,566 = 68% within: 26,794,210 - 31,823,341
2nd standard deviation = 5,029,131 (2,514,566 * 2) = 95% within: 24,279,644 - 34,337,907
3rd standard deviation = 7,543,697 (2,514,566 * 3) = 99.7% within: 21,765,079 - 36,852,473

As you can see below, the distribution of data conforms to a normal curve (bell-shaped curve).

# of RECORDS          # of NODES
26 M                  1             #
27 M                  2             ##
28 M                  1             #
29 M                  1             #
30 M                  3             ###
32 M                  2             ##
33 M                  1             #
35 M                  1             #

Additional info on Standard Deviation : Standard Deviation and Variance

2 Likes