Migration plan was to take a asbackup and restore in new cluster and start using new cluster. Since new cluster in use we are having 40% perf degradation on our post calls.
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 16
transaction-queues 16
transaction-threads-per-queue 4
migrate-threads 32
migrate-xmit-priority 0
migrate-read-priority 0
migrate-xmit-hwm 150
migrate-xmit-lwm 100
proto-fd-max 15000
proto-fd-idle-ms 7200000
New node config:
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 16
transaction-queues 16
transaction-threads-per-queue 4
migrate-threads 16
proto-fd-max 15000
proto-fd-idle-ms 7200000
No clear reason from the information you have provided.
A possibly candidate is that you are often reading/updating records from write-cache via the post-write-queue. In this scenario, you would have lost your write-cache and would observe higher latencies until your more frequently updated/read keys were in the post-write-queue. You can check this in your logs by looking for cache-read-pct:
Feb 12 2018 22:04:18 GMT: INFO (info): (ticker.c:536) {test} device-usage: used-bytes 881152 avail-pct 99 cache-read-pct 100.00
If your use-case is benefiting significantly from the post-write-queue, you may want to consider increasing its parameters.
Could you share your namespace config. Specifically interested in data-in-memory but if that is false other details could be useful. If data-in-memory is true then the post write queue is unnecessary and disabled.
Ok so data-in-memory was true so tge cache is disabled.
Two things are abnormal in this config:
Typically, nvme doesn’t use data-in-memory. Should try benchmarking without it, would save on ram costs. You could also bump you post-write-queue settings to gain a more performance on frequently accessed keys.
Typically, nvme doesn’t use a file for persistence. Normally the disk or partition is configured withe device, which bypasses the filesystem altogether. Benchmarking this is trickier because a small dataset could be fully cached in the filesystems cache (btw this is could the cause of the performance reduction you observed). Not using the file-systems cache will provide a predictable latency profile. We have also seen network problems caused by the file system cache using all of the ram - the kernel will free cache as needed, but it cannot do so within an network interrupt.