I have a six-node Aerospike 3.11.1.1 cluster in production. The asd
process is steadily growing, but only on one of the nodes. A second node is also showing growth, but less dramatic:
The namespace is configured are follows:
namespace intent {
replication-factor 2
memory-size 50G
default-ttl 90d
high-water-memory-pct 80
storage-engine device {
file /mnt/aerospike-data/intent.dat
filesize 120G
data-in-memory true
}
}
The application is using Aerospike as a simple key-value store. Each node has about 250 client connections. All clients are written in Clojure and use the Java API. The writers (a Storm cluster) call get
and put
with a 90-day expiration. The readers call get
. All clients initialize AerospikeClient
with an array of all six hosts, so I can’t think of anything that is special about the node the memory usage of which is growing.
Output of asadm -e "info"
: asadm -e "info" · GitHub
As you see, Aerospike reports using about the same about of memory, around 24 GB, on all the nodes.
Some memory stats from the node where memory consumption is growing:
matching-aerospike-r3-2xl-0:~$ ps up $(pidof asd)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 9975 8.4 87.4 57901740 54996076 ? Ssl 2017 9885:38 /usr/bin/asd --config-file /etc/aerospike/aerospike.conf --fgdaemon
matching-aerospike-r3-2xl-0:~$ free -m
total used free shared buff/cache available
Mem: 61440 54220 854 1598 6365 5456
Swap: 0 0 0
matching-aerospike-r3-2xl-0:~$ ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
matching-aerospike-r3-2xl-0:~$
Can someone please help me debug this? This has happened around three times in the last year, and we have “solved” it by restarting asd
on the affected node. But this time, I’d like to get to the bottom of it. Are there any known bugs in version 3.11.1.1 that are relevant, or anything about our usage pattern that could trigger the funny behavior? Thanks!