What is the delay between node dies and rebalancing process occurs?


#1

Hello, I am trying to tune the parameter of Aerospike. I understand that with replication factor of 2 for example, each data will be written to 2 different nodes. But what if one nodes dies? Document said that the partition that the died node held will be migrated to another nodes. Here is the case: The size of partition is too big, says few GBs, but the died node may return to alive after restart in short time, say a few minutes. So the trade off here: If we wait for the died node to return for too long, the replication factor cannot be guaranteed. If we soon migrate the partition (which is big) to another node, it consumes a lot of time and bandwidth. Maybe during the migration, the died node returns alive, So what happen there? So would like to set the time the cluster will wait for the died node to return, how to set that? Thank you.


#2

Is there any way to set cluster migration delay / timeout? Thank you.


#3

#4

I see we can speed up or slow down (up to pause) migration process. But in case node is spike because of network or itself issue just for a few seconds / minutes, when it leaves the cluster, the migration happens, then it joins the cluster again, and another migration happens. It takes time to rebalance and cause some high error rate from API side. So, my question is there any way to set the delay for migration start time in this case, e.g. it waits for some minutes timeouts/thresholds to start the migration when a node leave the cluster. Thank you.


#5

There isn’t a way to delay rebalance.

What errors are you seeing?

Which client?

Which version of Aerospike Server?


#6

Mostly errors about execution timeout on client when doing batch get during migration time. Our client is go 1.29/1.30, server is C-3.14. Thanks.


#7

Go 1.30 added the newer batch API, the older API had issues during migration.

With the new batch (and other transactions), I would expect a spike in timeouts for the few seconds it takes Aerospike to discover the cluster change (typically 2-4 but may differ depending on heartbeat settings). Afterward, there may a slight increase in timeouts over normal with the default migration settings as the defaults try to not be overly aggressive.


#8

So the timeouts issue during migration cause by client continue to send request to dead node or by unstable cluster at that time, because I can see timeout error for the requests to alive nodes also.

@kporter With the new batch (and other transactions), I would expect a spike in timeouts for the few seconds it takes Aerospike to discover the cluster change (typically 2-4 but may differ depending on heartbeat settings).

About heartbeat settings, can we try to increase timeout value? Which is better for timeouts issue? It’s back to the question above, we accept cluster unstable (slower to discovery the cluster change, means delay themigration start) or dead node should be detect, and rebalance happens.

@kporter Afterward, there may a slight increase in timeouts over normal with the default migration settings as the defaults try to not be overly aggressive.

Should we speed up migration (increase migration-threads) to shorter the rebalance time but makes higher timeouts, or keep default settings to accept slight increase in timeouts but longer duration.

Thanks.


#9

Timeouts to either the dead or alive nodes are expected during the window between the node leaving the cluster and the cluster discovering that the node is gone. This is because requests to alive nodes may be trying to replicate to the dead node.

The heartbeat interval * timeout determine how quickly the cluster can detect a change. Increasing these values will cause a larger windows of increased timeouts. The setting need to be balanced with your environment, the default values (interval:150, timeout:10) assume a reliable network, often we find cloud environments less reliable and suggest increasing the interval to 250 - but this also increases the window where your clients will see a spike in timeouts from 1.5 seconds to 2.5 seconds.

This is dependent on what is acceptable for your use case. Note that migrations/rebalance typically result in far less load on Aerospike Enterprise since only the data which has diverged is migrated.