I have a cluster of three nodes that use Aerospike 3.6.0. Last week I was upgrading memory on each node. I have a process like following:
- Stop aerospike on one node
- Speed migration process up for other nodes
- As soon as migration process has finished I start aerospike service
- Speed migration process down using default settings
When service has started cluster has 1-2 minutes downtime. This downtime was repeated when I was upgrading each node.
Here are namespace infos:
Before maintanence node is going to start
[ 2016-07-20 09:56:57 'info namespace' sleep: 5.0s iteration: 2928 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
<node ip addr> N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-5 ssd0 24 146319233 682.371 M 726.184 M 2 false 244.619 GB 66 50 83.956 GB 58 60 90
v-6 ssd0 26 56304299 726.151 M 682.376 M 2 false 244.615 GB 66 50 83.955 GB 54 60 90
when node has started
[ 2016-07-20 09:57:02 'info namespace' sleep: 5.0s iteration: 2929 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
<node ip addr> ssd0 29 0 430.650 M 0.000 2 false 118.855 GB 32 50 40.712 GB 27 60 90
v-5 ssd0 24 146319233 682.360 M 726.184 M 2 false 244.617 GB 66 50 83.956 GB 58 60 90
v-6 ssd0 26 56304299 726.151 M 682.364 M 2 false 244.613 GB 66 50 83.954 GB 54 60 90
when downtime has started
[ 2016-07-20 09:57:08 'info namespace' sleep: 5.0s iteration: 2930 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
v-5 ssd0 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-6-07-2 ssd0 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E 0m N/E N/E
92mv-6-07-4 ssd0 29 0 430.650 M 0.000 2 false 118.855 GB 32 50 40.712 GB 27 60 90
during downtime
[ 2016-07-20 09:57:16 'info namespace' sleep: 5.0s iteration: 2931 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
v-5 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-6-07-2 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-6-07-4 ssd0 29 0 430.650 M 0.000 2 false 118.855 GB 32 50 40.712 GB 27 60 90
when the cluster has been restored
[ 2016-07-20 09:59:39 'info namespace' sleep: 5.0s iteration: 2949 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
v-5 ssd0 28 146319233 463.015 M 245.189 M 2 false 161.356 GB 44 50 55.234 GB 38 60 90
v-6-07-2 ssd0 27 56304299 492.549 M 222.326 M 2 false 164.149 GB 45 50 56.534 GB 37 60 90
v-6-07-4 ssd0 31 0 430.658 M 0.000 2 false 118.857 GB 32 50 40.713 GB 27 60 90
Here are service infos:
Before:
[ 2016-07-20 09:56:59 'info service' sleep: 5.0s iteration: 2897 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Build Cluster Cluster Cluster Free Free Migrates Principal Objects Uptime
. . Size Visibility Integrity Disk% Mem% . . . .
<node ip addr> N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-5 3.6.0 2 True True 34 42 (0,0) v-5 1.432 G 145:58:53
v-6 3.6.0 2 True True 34 45 (0,0) v-5 1.432 G 49:04:24
Number of rows: 3
when node has started
[ 2016-07-20 09:57:04 'info service' sleep: 5.0s iteration: 2898 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Build Cluster Cluster Cluster Free Free Migrates Principal Objects Uptime
. . Size Visibility Integrity Disk% Mem% . . . .
<node ip addr> 3.6.0 3 False True 67 73 (2798,0) BB9C5910842A844 701.995 M 01:12:39
v-5 3.6.0 3 True True 35 42 (0,0) BB9C5910842A844 1.424 G 145:58:58
v-6 3.6.0 3 True True 35 46 (0,0) BB9C5910842A84
during downtime
[ 2016-07-20 09:57:15 'info service' sleep: 5.0s iteration: 2899 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Build Cluster Cluster Cluster Free Free Migrates Principal Objects Uptime
. . Size Visibility Integrity Disk% Mem% . . . .
v-5 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
v-6-07-2 N/E N/E N/E N/E N/E N/E N/E N/E N/E N/E
mv-6-07-4 3.6.0 3 True True 67 73 (2808,2) v-6-07-4 701.995 M 01:12:45
Number of rows: 3
46 (0,0) BB9C5910842A844 1.425 G 49:04:29
Number of rows: 3
after restore
[ 2016-07-20 09:59:49 'info service' sleep: 5.0s iteration: 2912 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Build Cluster Cluster Cluster Free Free Migrates Principal Objects Uptime
. . Size Visibility Integrity Disk% Mem% . . . .
v-5 3.6.0 3 False True 57 61 (1208,1) v-6-07-4 942.102 M 146:01:43
v-6-07-2 3.6.0 3 False True 56 63 (1160,0) v-6-07-4 964.572 M 49:07:15
v-6-07-4 3.6.0 3 True True 67 73 (3123,2) v-6-07-4 702.005 M 01:15:24
Number of rows: 3
BB9C5910842A844 1.425 G 49:04:29
My questions are:
- What’s going wrong with my cluster?
- What should I do to increase cluster availablilty higher when adding nodes?