I have 6 node cluster. Each node is a bare metal machine with 24 core and 256 GB RAM and 10 Gbps Network running on CentOS 7.4.1708 (Core) (kernel 4.14.0-1.el7.elrepo.x86_64).
Aerospike config :
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
#service-threads 24
#transaction-queues 24
transaction-threads-per-queue 4
proto-fd-max 30000
transaction-pending-limit 0
auto-pin cpu
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address int4
port 3000
access-address int4
}
heartbeat {
# mode multicast
# address 224.0.0.116 # 239.1.99.222
# port 9918
mode mesh
port 3002
mesh-seed-address-port ....
mesh-seed-address-port ....
...
...
...
...
interval 150
timeout 10
}
fabric {
address int4
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 2
memory-size 1G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine memory
}
#production namespace
namespace Production {
replication-factor 2
memory-size 242G
default-ttl 0 # 30 days, use 0 to never expire/evict.
high-water-disk-pct 50 # How full may the disk become before the
# server begins eviction (expiring records
# early)
high-water-memory-pct 85 # How full may the memory become before the
# server begins eviction (expiring records
# early)
stop-writes-pct 90 # How full may the memory become before
# we disallow new writes
partition-tree-sprigs 4096
partition-tree-locks 256
# storage-engine memory
storage-engine device {
#device /dev/sdb1
#data-in-memory false
file /opt/aerospike/data/prod.data
filesize 1000G # 8 times of RAM
data-in-memory true
#write-block-size 128K # adjust block size to make it efficient for SSDs.
# largwst size of any object
}
}
Total load (in TPS) -
Read: 120-250K
Batch_read : 300-400K
Write: 4-10K (peak 80K)
UDF: 500-1000
Queries: 60-80
Cluster Info:
Aerospike server version: 3.12.1.3
Master Object count: ~ 2B
RAM uses per node: ~ 60%
There are multiple clients (Mostly latest Go client) that read and write. I have noticed that sometimes latency goes high (asloglatency tool and on client side stats) and after sometimes(few hours) it comes down to normal without any change. I have checked TPS during that time but it seems independent of it. I couldn’t find anything in logs.
how should I find the root cause? This happens a couple of time every week with no time/load pattern (At least I have not observed any till now). Please suggest.