In multisite cluster (3 regions - 3 nodes in each region with replica factor 3). One node went offline and it causes failures in the write workload of the cluster for approx 12 seconds, then aerospike created temp replica and make another replica as master partition and workload processed successfully.
Please advise how to mitigate the number of failures in multisite cluster.
If a node just dies, not much can be done about the latency involved in a) cluster detecting a node just went out for good (heartbeat count related) and then the client rediscovering the partition map - 1 sec tend interval.
Without going into too much detail, if you used default heartbeat count (timeout) and interval settings (10, 150 ms), cluster will detect a node has left in somewhere around 2.3 to 3.3 seconds. (Cluster does not kick a node out just because a few heartbeats were missed). Then, worst case client will detect new partition map after 1 second (tend interval). Having said that, 12 seconds seems very long. I could buy 4.5 seconds worst case. What are your heartbeat settings for interval and timeout?
heartbeat {
mode mesh
port 3002
address any
# mesh-seed-address-port 172.xx.yy.zz 3002
interval 150
timeout 10
}
However, if you are bringing a node down for maintenance, you can use the quiesce feature to remove it gracefully without any client side issues.
10 / 250 puts you into more like 3+ to 4+ seconds for cluster to detect a node died. So that combined with the timeout perhaps explains the higher number you are seeing but nodes shouldn’t be just dying suddenly.
If you are bringing them down for maintenance or testing, then use quiescing method and clients will not see any transaction timeout. So which is one is your issue - sudden node death or operator bringing the node down?
Please note, making the numbers too tight is also counter-productive as it increases the probability of the cluster falling apart due to network latencies, glitches etc. of the heartbeat signal.
@pgupta please advise how to avoid lock while performing read operation. We are running a POT where there are no updates in set, only insert and reads are performed.
Read locking is killing the performance in this POT.
It’s multisite cluster with strong consistency as mentioned in this thread.
In the entire read transaction, the only time read takes a sprig lock is during reading data from storage device.
Flow diagram, bit outdated - (no tt - transaction thread anymore, all handled by st - service thread, regardless) but still relevant to your question:
And the sprig lock is taken briefly during write-master:
For reads and inserts to contend with each other, they have to be happening simultaneously on the same sprig. We have fixed 256 sprig-locks per partition, 4K partitions - so a million locks. Cannot change that even if you increase partition-tree-sprigs. First 8 bits of sprig id (8 to 28 max) are that sprig’s sprig lock id. I am really surprised that you are having lock contention between reads and inserts. Something else may be going on. Perhaps open a support ticket and they can figure out from the logs etc.
We are getting issue between multiple threads are reading same rows. Some threads are able to process within 5-8 ms but others are taking 1200-2000ms (approx 7000 rows has been read) by 30 concurrent threads.
Please advise how to avoid it. If read have lock, same bottleneck will be there.
This config seems invalid for SC Mode:
May 24 17:02:39 ip-10-3-0-134 asd[1788]: May 24 2024 17:02:39 GMT: WARNING (info): (cfg_info.c:1681) {ASPOTNS} 'read-consistency-level-override' is not applicable with 'strong-consistency'
Configuration parameter read-consistency-level-override does not apply to to Strong Consistency Mode - so that is a configuration error on your server. It is applicable in AP mode only.
strong-consistency enabled namespaces always duplicate resolve when migrations are ongoing and consult the different potential versions of a record before returning to the client. This configuration is therefore not available for strong-consistency enabled namespaces.
@pgupta read-page-cache is already enabled but still not getting the response time for reads under 250 ms with 30 threads.
Please advise if there are any other way to handle this.
Please see the read microbenchmarks slide that I shared above. You will have to enable read micro benchmarks and investigate where you are losing time. The microbenchmarks will give a breakdown of time being used in various read phases internal to the server. That will give an insight into the root cause of the latency.
Auto enabled read macro benchmark tells you what reads are actually taking at the server – the analogous “service time”. (Ignore transaction threads in fig. now all work done by svc threads.)
Threading at application level does not help with hotkeys - its like multiple clients connecting to the server. The same key (record) is served - first come first serve. (Hot key problem.) By enabling micro benchmarks you can identify which phase of the total read path is the dominant source of latency. First we have to do that, then one can think of making it better.
In a stable sc cluster, where nodes have not gone in and out, you will not have any duplicate-resolution unless you are doing Linearized read (read transaction sc mode policy - I doubt if you have set that true, default is false).
These are the default histograms you can see in your logs … first see what the read histogram looks like, and kindly share.