Perf degradation after Aerospike Migration for better


#1

We are facing perf degradation after migration

From Earlier 3.11 TO 3.15

From Earlier Non-nvme to NVME

FROM Earlier IDE to VIRTIO

Migration plan was to take a asbackup and restore in new cluster and start using new cluster. Since new cluster in use we are having 40% perf degradation on our post calls.

Before Migration Response:

After Migration Response:

old node config:

  user root

    group root

    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.

    pidfile /var/run/aerospike/asd.pid

    service-threads 16

    transaction-queues 16

    transaction-threads-per-queue 4

    migrate-threads 32

    migrate-xmit-priority 0

    migrate-read-priority 0

    migrate-xmit-hwm 150

    migrate-xmit-lwm 100

    proto-fd-max 15000

    proto-fd-idle-ms 7200000

New node config:

    user root

    group root

    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.

    pidfile /var/run/aerospike/asd.pid

    service-threads 16

    transaction-queues 16

    transaction-threads-per-queue 4

    migrate-threads 16

    proto-fd-max 15000

    proto-fd-idle-ms 7200000

#2

Looks like the latency has got back normal. Any reason or explanation on this?


#3

No clear reason from the information you have provided.

A possibly candidate is that you are often reading/updating records from write-cache via the post-write-queue. In this scenario, you would have lost your write-cache and would observe higher latencies until your more frequently updated/read keys were in the post-write-queue. You can check this in your logs by looking for cache-read-pct:

Feb 12 2018 22:04:18 GMT: INFO (info): (ticker.c:536) {test} device-usage: used-bytes 881152 avail-pct 99 cache-read-pct 100.00

If your use-case is benefiting significantly from the post-write-queue, you may want to consider increasing its parameters.


#4

Nope. I didn’t find anything as such in all the logs of new cluster.


#5

Could you share your namespace config. Specifically interested in data-in-memory but if that is false other details could be useful. If data-in-memory is true then the post write queue is unnecessary and disabled.


#6
namespace roger {
    memory-size 28G
    replication-factor 2
    default-ttl 0
    storage-engine device {
        file /var/lib/aerospike/roger.dat
        filesize 116G
        data-in-memory true
	write-block-size 128K
    	defrag-lwm-pct 50
    	defrag-startup-minimum 10
    }
}

#7

Ok so data-in-memory was true so tge cache is disabled.

Two things are abnormal in this config:

  1. Typically, nvme doesn’t use data-in-memory. Should try benchmarking without it, would save on ram costs. You could also bump you post-write-queue settings to gain a more performance on frequently accessed keys.
  2. Typically, nvme doesn’t use a file for persistence. Normally the disk or partition is configured withe device, which bypasses the filesystem altogether. Benchmarking this is trickier because a small dataset could be fully cached in the filesystems cache (btw this is could the cause of the performance reduction you observed). Not using the file-systems cache will provide a predictable latency profile. We have also seen network problems caused by the file system cache using all of the ram - the kernel will free cache as needed, but it cannot do so within an network interrupt.