Aerospike latency issues


We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.

Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.

The version in aws is aerospike-server-community- the version in Sl is Aerospike Enterprise Edition 3.6.3

Our config as follows

# Aerospike database configuration file.

service {
  user xxxxx
  group xxxxx
  paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
  pidfile /var/run/aerospike/
  service-threads 8
  transaction-queues 8
  transaction-threads-per-queue 8
  proto-fd-max 15000

logging {
  # Log file must be an absolute path.
  file /var/log/aerospike/aerospike.log {
    context any info

network {
  service {
    port 13000
    address h1

  heartbeat {
    mode mesh
    port 13001
    address h1
    mesh-seed-address-port h1 13001
    mesh-seed-address-port h2 13001
    mesh-seed-address-port h3 13001
    mesh-seed-address-port h4 13001
    interval 150
    timeout 10

  fabric {
    port 13002
    address h1

  info {
    port 13003
    address h1

namespace XXXX {
  replication-factor 2
  memory-size 27G
  default-ttl 10d
  high-water-memory-pct 70
  high-water-disk-pct 60
  stop-writes-pct 90
  storage-engine device {
    device /dev/nvme0n1
    scheduler-mode noop
    write-block-size 128K

What should be done to bring down latencies in aws?

I think the histograms would be a good place to start but from what you’ve described I’m not quite sure. One thing I did notice is that you have these thread metrics defined, any reason why you specifically wanted 8 transaction-threads-per-queue? How are you benchmarking, can you share more of those results? How is the benchmark test different from the other latency you reported? Is the app distance from cluster being considered?