Intermittent high latency


#1

I have 6 node cluster. Each node is a bare metal machine with 24 core and 256 GB RAM and 10 Gbps Network running on CentOS 7.4.1708 (Core) (kernel 4.14.0-1.el7.elrepo.x86_64).

Aerospike config :

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        #service-threads 24
        #transaction-queues 24
        transaction-threads-per-queue 4
        proto-fd-max 30000
        transaction-pending-limit 0
        auto-pin cpu
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address int4
                port 3000
                access-address int4
        }

        heartbeat {


#               mode multicast
 #              address  224.0.0.116 # 239.1.99.222
  #                     port 9918

    mode mesh
    port 3002

    mesh-seed-address-port ....
    mesh-seed-address-port ....
    ...
    ...
    ...
    ...


    interval 150
                timeout 10
        }

        fabric {
                address int4
                port 3001
        }

        info {
                port 3003
        }
}


namespace test {
        replication-factor 2
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory
}

#production namespace
namespace Production {
  replication-factor 2
  memory-size 242G
  default-ttl 0 # 30 days, use 0 to never expire/evict.
        high-water-disk-pct 50 # How full may the disk become before the
                               # server begins eviction (expiring records
                               # early)
        high-water-memory-pct 85 # How full may the memory become before the
                                 # server begins eviction (expiring records
                                 # early)
        stop-writes-pct 90  # How full may the memory become before
                            # we disallow new writes
        partition-tree-sprigs 4096
        partition-tree-locks 256
  # storage-engine memory
  storage-engine device {
                #device /dev/sdb1
                #data-in-memory false

    file /opt/aerospike/data/prod.data
    filesize 1000G # 8 times of RAM
    data-in-memory true

                #write-block-size 128K   # adjust block size to make it efficient for SSDs.
                # largwst size of any object
        }
} 

Total load (in TPS) -

Read: 120-250K

Batch_read : 300-400K

Write: 4-10K (peak 80K)

UDF: 500-1000

Queries: 60-80

Cluster Info:

Aerospike server version: 3.12.1.3

Master Object count: ~ 2B

RAM uses per node: ~ 60%

There are multiple clients (Mostly latest Go client) that read and write. I have noticed that sometimes latency goes high (asloglatency tool and on client side stats) and after sometimes(few hours) it comes down to normal without any change. I have checked TPS during that time but it seems independent of it. I couldn’t find anything in logs.

how should I find the root cause? This happens a couple of time every week with no time/load pattern (At least I have not observed any till now). Please suggest.


#2

The microbenchmark systems should help narrow it down https://www.aerospike.com/docs/operations/monitor/latency/index.html .