Aerospike Partitions Migration Internals

Mohit_Gupta · July 23, 2019, 11:45am

Is there any documentation available to understand in detail how Aerospike handles migrations internally? We have an aerospike cluster (v3.8.1, replication factor - 2) in Google Cloud and every few hours re-balancing & migrations are getting triggered likely due to network fluctuations. Thereafter AMC shows replica objects count as significantly lower than the master objects.

Since there is no disk/node loss, I want to understand if the replica factor is still being maintained for all the partitions during the migrations in the above scenario.

Also, if a node actually goes down while the migrations are going on due to intermittent network fluctuations, is there a possibility of permanent data loss (both master and replica lost)?

Thanks

kporter · July 23, 2019, 7:29pm

Replication-factor wasn’t always maintained during migrations prior to the paxos-protocol switch in 3.13. Therefore there is a potential for data loss in the scenario you are experiencing. The cluster instability may also be attributed in part to the older algorithms used for clustering prior to the protocol change.

The new protocols do maintain replication factor during migrations as well as address many other concerns.

kporter · July 24, 2019, 6:23pm

Btw, we have seen live-migrations causing this sort of issue. You may want to consider bumping your heartbeat.timeout a bit. The default heartbeat interval is 150 and timeout is 10 - so if the live migrations takes longer than 1.5 seconds the cluster will break. May consider increasing the timeout to 20 - doing so will also increase the duration of write timeouts experienced by the clients when restarting nodes.

Also, for Enterprise users, there are scripts to prevent timeouts during live migrations - see GitHub - aerospike/aerospike-google-maintenance: Drain Aerospike on Google maintenance event.

Mohit_Gupta · July 25, 2019, 1:42pm

Thanks @kporter. I increased the timeout and it totally stopped the migrations. In-parallel, I tried to map the live-migration events reported by Google Cloud logging to the migrations spike in Aerospike, but they were not directly matching one-to-one. I’ll further check on this.

Topic		Replies	Views
Cluster upgrade	7	1249	May 29, 2017
Preventing data loss during data migration Operations	12	3007	February 9, 2016
What is the delay between node dies and rebalancing process occurs? How Developers Are Using Aerospike	8	3375	January 3, 2018
How can I tell when a migration is finished? Monitoring	3	5566	August 16, 2014
Replicas invalidated after restart Upgrading	18	4103	October 13, 2016

Aerospike Partitions Migration Internals

Related topics