Aerospike multisite cluster - resiliency

shubham3582 · May 15, 2024, 5:01pm

In multisite cluster (3 regions - 3 nodes in each region with replica factor 3). One node went offline and it causes failures in the write workload of the cluster for approx 12 seconds, then aerospike created temp replica and make another replica as master partition and workload processed successfully.

Please advise how to mitigate the number of failures in multisite cluster.

pgupta · May 16, 2024, 6:24pm

If a node just dies, not much can be done about the latency involved in a) cluster detecting a node just went out for good (heartbeat count related) and then the client rediscovering the partition map - 1 sec tend interval.

Without going into too much detail, if you used default heartbeat count (timeout) and interval settings (10, 150 ms), cluster will detect a node has left in somewhere around 2.3 to 3.3 seconds. (Cluster does not kick a node out just because a few heartbeats were missed). Then, worst case client will detect new partition map after 1 second (tend interval). Having said that, 12 seconds seems very long. I could buy 4.5 seconds worst case. What are your heartbeat settings for interval and timeout?

 heartbeat {
    mode mesh
    port 3002
    address any
    # mesh-seed-address-port 172.xx.yy.zz 3002
    interval 150
    timeout 10
  }

However, if you are bringing a node down for maintenance, you can use the quiesce feature to remove it gracefully without any client side issues.

shubham3582 · May 16, 2024, 7:59pm

@pgupta currently we have timeout 10 and interval 250. Also aerospike client have 5000 for timeout.

pgupta · May 16, 2024, 8:11pm

10 / 250 puts you into more like 3+ to 4+ seconds for cluster to detect a node died. So that combined with the timeout perhaps explains the higher number you are seeing but nodes shouldn’t be just dying suddenly.

If you are bringing them down for maintenance or testing, then use quiescing method and clients will not see any transaction timeout. So which is one is your issue - sudden node death or operator bringing the node down?

shubham3582 · May 17, 2024, 1:06pm

thanks @pgupta for explanation. We updated settings as 100 intervals and 10 timeout, we will test it again and keep you posted.

pgupta · May 17, 2024, 3:30pm

Please note, making the numbers too tight is also counter-productive as it increases the probability of the cluster falling apart due to network latencies, glitches etc. of the heartbeat signal.

This is a good FAQ on the topic: FAQ What is the Quantum Interval and how does it affect cluster reformation time?

Ali.ambere · May 23, 2024, 10:22am

Hello, As per my knowledge To mitigate failures in a multisite cluster like yours, ensure to:

Implement proactive monitoring to detect node failures promptly.
Optimize network configurations to reduce latency between regions.
Fine-tune replica and partition configurations to balance workload distribution.
Implement automatic failover mechanisms to quickly recover from node failures.
Regularly test failover procedures to ensure smooth transition during failures. I hope this will help you, Thank you

pgupta · May 23, 2024, 4:51pm

@Ali.ambere are you currently using Aerospike?

shubham3582 · May 24, 2024, 4:38pm

@pgupta please advise how to avoid lock while performing read operation. We are running a POT where there are no updates in set, only insert and reads are performed. Read locking is killing the performance in this POT. It’s multisite cluster with strong consistency as mentioned in this thread.

pgupta · May 24, 2024, 5:15pm

In the entire read transaction, the only time read takes a sprig lock is during reading data from storage device.
Flow diagram, bit outdated - (no tt - transaction thread anymore, all handled by st - service thread, regardless) but still relevant to your question:

For inserts, the internal flow is (again no tt, just st now and some other changes - but still relevant to your question):

And the sprig lock is taken briefly during write-master:

For reads and inserts to contend with each other, they have to be happening simultaneously on the same sprig. We have fixed 256 sprig-locks per partition, 4K partitions - so a million locks. Cannot change that even if you increase partition-tree-sprigs. First 8 bits of sprig id (8 to 28 max) are that sprig’s sprig lock id. I am really surprised that you are having lock contention between reads and inserts. Something else may be going on. Perhaps open a support ticket and they can figure out from the logs etc.

shubham3582 · May 24, 2024, 5:30pm

Thanks @pgupta

We are getting issue between multiple threads are reading same rows. Some threads are able to process within 5-8 ms but others are taking 1200-2000ms (approx 7000 rows has been read) by 30 concurrent threads.

Please advise how to avoid it. If read have lock, same bottleneck will be there.

This config seems invalid for SC Mode:

May 24 17:02:39 ip-10-3-0-134 asd[1788]: May 24 2024 17:02:39 GMT: WARNING (info): (cfg_info.c:1681) {ASPOTNS} 'read-consistency-level-override' is not applicable with 'strong-consistency'

Thanks Shubham

pgupta · May 24, 2024, 8:20pm

Configuration parameter read-consistency-level-override does not apply to to Strong Consistency Mode - so that is a configuration error on your server. It is applicable in AP mode only.

strong-consistency enabled namespaces always duplicate resolve when migrations are ongoing and consult the different potential versions of a record before returning to the client. This configuration is therefore not available for strong-consistency enabled namespaces.

pgupta · May 24, 2024, 8:27pm

If you are reading the same record from multiple threads - i.e. you have hotkey read situation, turning read-page-cache on may help.

shubham3582 · May 25, 2024, 1:09am

@pgupta read-page-cache is already enabled but still not getting the response time for reads under 250 ms with 30 threads. Please advise if there are any other way to handle this.

pgupta · May 25, 2024, 5:31pm

Please see the read microbenchmarks slide that I shared above. You will have to enable read micro benchmarks and investigate where you are losing time. The microbenchmarks will give a breakdown of time being used in various read phases internal to the server. That will give an insight into the root cause of the latency.

This docs page explains the read micro benchmarks and how to enable them - Histograms from Aerospike Logs | Aerospike Documentation

What you are currently measuring the application is the total response time … see analogy below:

Auto enabled read macro benchmark tells you what reads are actually taking at the server – the analogous “service time”. (Ignore transaction threads in fig. now all work done by svc threads.)

Threading at application level does not help with hotkeys - its like multiple clients connecting to the server. The same key (record) is served - first come first serve. (Hot key problem.) By enabling micro benchmarks you can identify which phase of the total read path is the dominant source of latency. First we have to do that, then one can think of making it better.

In a stable sc cluster, where nodes have not gone in and out, you will not have any duplicate-resolution unless you are doing Linearized read (read transaction sc mode policy - I doubt if you have set that true, default is false).

These are the default histograms you can see in your logs … first see what the read histogram looks like, and kindly share.

Understanding this output …

You can use asadm to analyze the log file:

Then, once you enable readm microbenchmarks, you can dive deeper and analyze as below:

Look at read-restart - that indicates potention hotkey issue.

Also look at read-local

Then if needed, enable storage-benchmarks. May be a slow device?

Topic		Replies	Views
Handling node failure on client	4	3812	September 23, 2024
0 downtime configuration Tuning	3	1831	September 8, 2014
Minimal Heartbeat Configuration	2	2277	February 21, 2016
Java client return timeout once one of the nodes is down Client Libraries java	2	860	May 4, 2022
Observing steady increase timeouts from aerospike server error	0	1735	March 10, 2016

Aerospike multisite cluster - resiliency

Related topics