Growing asd process on only one node


#1

I have a six-node Aerospike 3.11.1.1 cluster in production. The asd process is steadily growing, but only on one of the nodes. A second node is also showing growth, but less dramatic:

aerospike-memory-growth

The namespace is configured are follows:

namespace intent {
  replication-factor 2
  memory-size 50G
  default-ttl 90d
  high-water-memory-pct 80
  storage-engine device { 
    file /mnt/aerospike-data/intent.dat
    filesize 120G
    data-in-memory true
  }
}

The application is using Aerospike as a simple key-value store. Each node has about 250 client connections. All clients are written in Clojure and use the Java API. The writers (a Storm cluster) call get and put with a 90-day expiration. The readers call get. All clients initialize AerospikeClient with an array of all six hosts, so I can’t think of anything that is special about the node the memory usage of which is growing.

Output of asadm -e "info": https://gist.github.com/ljosa/9b3fb5327b266b405d5c5a6760c0f176 As you see, Aerospike reports using about the same about of memory, around 24 GB, on all the nodes.

Some memory stats from the node where memory consumption is growing:

matching-aerospike-r3-2xl-0:~$ ps up $(pidof asd)
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9975  8.4 87.4 57901740 54996076 ?   Ssl   2017 9885:38 /usr/bin/asd --config-file /etc/aerospike/aerospike.conf --fgdaemon
matching-aerospike-r3-2xl-0:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:          61440       54220         854        1598        6365        5456
Swap:             0           0           0
matching-aerospike-r3-2xl-0:~$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
matching-aerospike-r3-2xl-0:~$ 

Can someone please help me debug this? This has happened around three times in the last year, and we have “solved” it by restarting asd on the affected node. But this time, I’d like to get to the bottom of it. Are there any known bugs in version 3.11.1.1 that are relevant, or anything about our usage pattern that could trigger the funny behavior? Thanks!


#2

Did a scan over our release notes, I don’t see any mentions of memory leaks being discovered since one was fixed in the build you are presently using.

If this is a leak and the leak is still present in the latest versions of the server; the server has since equipped with a new set of tools to determine the source of the leak.

Released in 3.14.1.1: [AER-5683] - (KVS) New memory subsystem with configurable allocation debugging.


#3

Thanks! I also see this in the changelog for 3.12.0: “[AER-5526] - (SMD) Memory leak on principal during merge.” Do you think that one is irrelevant to my symptoms?

I tried to look at https://github.com/aerospike/aerospike-server/commit/4becb02d2e514ff5f1a663d8b854e2f65f9a75bb, but it appears that the repo has been force pushed and all history removed. Is that intensional?

I will try to upgrade to 3.16.0.6. I’ll report back, but may have to watch it for quite some time first since the problem takes a good while to manifest.


#4

A leak in SMD would be quite slow and only appear after on rebalance. This particular bug only affected the principal node, and your cluster’s principal reports an uptime of 8497:40:55 (354 days).


#5

Hey Ljosa.

Sorry about the force push, but it was intentional. We took the liberty at 4.0, because we were dragging around years of history, and the Strong Consistency available in Enterprise is a big deal.

Although this creates an annoyance for you at 3.16, we just did it at the 4.0 break. Hopefuly you’ll move not just to 3.16 but to 4.0, the CE changes are slight and full compatibility is there. The water’s fine in 4.0! Come on in.