Aerospike Partitions Migration Internals

Is there any documentation available to understand in detail how Aerospike handles migrations internally? We have an aerospike cluster (v3.8.1, replication factor - 2) in Google Cloud and every few hours re-balancing & migrations are getting triggered likely due to network fluctuations. Thereafter AMC shows replica objects count as significantly lower than the master objects.

Since there is no disk/node loss, I want to understand if the replica factor is still being maintained for all the partitions during the migrations in the above scenario.

Also, if a node actually goes down while the migrations are going on due to intermittent network fluctuations, is there a possibility of permanent data loss (both master and replica lost)?

Thanks

Replication-factor wasn’t always maintained during migrations prior to the paxos-protocol switch in 3.13. Therefore there is a potential for data loss in the scenario you are experiencing. The cluster instability may also be attributed in part to the older algorithms used for clustering prior to the protocol change.

The new protocols do maintain replication factor during migrations as well as address many other concerns.

1 Like

Btw, we have seen live-migrations causing this sort of issue. You may want to consider bumping your heartbeat.timeout a bit. The default heartbeat interval is 150 and timeout is 10 - so if the live migrations takes longer than 1.5 seconds the cluster will break. May consider increasing the timeout to 20 - doing so will also increase the duration of write timeouts experienced by the clients when restarting nodes.

Also, for Enterprise users, there are scripts to prevent timeouts during live migrations - see GitHub - aerospike/aerospike-google-maintenance: Drain Aerospike on Google maintenance event.

Thanks @kporter. I increased the timeout and it totally stopped the migrations. In-parallel, I tried to map the live-migration events reported by Google Cloud logging to the migrations spike in Aerospike, but they were not directly matching one-to-one. I’ll further check on this.