Growing asd process on only one node

ljosa · March 19, 2018, 7:09pm

I have a six-node Aerospike 3.11.1.1 cluster in production. The asd process is steadily growing, but only on one of the nodes. A second node is also showing growth, but less dramatic:

aerospike-memory-growth

The namespace is configured are follows:

namespace intent {
  replication-factor 2
  memory-size 50G
  default-ttl 90d
  high-water-memory-pct 80
  storage-engine device { 
    file /mnt/aerospike-data/intent.dat
    filesize 120G
    data-in-memory true
  }
}

The application is using Aerospike as a simple key-value store. Each node has about 250 client connections. All clients are written in Clojure and use the Java API. The writers (a Storm cluster) call get and put with a 90-day expiration. The readers call get. All clients initialize AerospikeClient with an array of all six hosts, so I can’t think of anything that is special about the node the memory usage of which is growing.

Output of asadm -e "info": asadm -e "info" · GitHub As you see, Aerospike reports using about the same about of memory, around 24 GB, on all the nodes.

Some memory stats from the node where memory consumption is growing:

matching-aerospike-r3-2xl-0:~$ ps up $(pidof asd)
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9975  8.4 87.4 57901740 54996076 ?   Ssl   2017 9885:38 /usr/bin/asd --config-file /etc/aerospike/aerospike.conf --fgdaemon
matching-aerospike-r3-2xl-0:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:          61440       54220         854        1598        6365        5456
Swap:             0           0           0
matching-aerospike-r3-2xl-0:~$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
matching-aerospike-r3-2xl-0:~$

Can someone please help me debug this? This has happened around three times in the last year, and we have “solved” it by restarting asd on the affected node. But this time, I’d like to get to the bottom of it. Are there any known bugs in version 3.11.1.1 that are relevant, or anything about our usage pattern that could trigger the funny behavior? Thanks!

kporter · March 20, 2018, 12:59am

Did a scan over our release notes, I don’t see any mentions of memory leaks being discovered since one was fixed in the build you are presently using.

If this is a leak and the leak is still present in the latest versions of the server; the server has since equipped with a new set of tools to determine the source of the leak.

Released in 3.14.1.1: [AER-5683] - (KVS) New memory subsystem with configurable allocation debugging.

ljosa · March 21, 2018, 4:14pm

Thanks! I also see this in the changelog for 3.12.0: “[AER-5526] - (SMD) Memory leak on principal during merge.” Do you think that one is irrelevant to my symptoms?

I tried to look at https://github.com/aerospike/aerospike-server/commit/4becb02d2e514ff5f1a663d8b854e2f65f9a75bb, but it appears that the repo has been force pushed and all history removed. Is that intensional?

I will try to upgrade to 3.16.0.6. I’ll report back, but may have to watch it for quite some time first since the problem takes a good while to manifest.

kporter · March 22, 2018, 8:32pm

A leak in SMD would be quite slow and only appear after on rebalance. This particular bug only affected the principal node, and your cluster’s principal reports an uptime of 8497:40:55 (354 days).

bbulkow · March 22, 2018, 8:47pm

Hey Ljosa.

Sorry about the force push, but it was intentional. We took the liberty at 4.0, because we were dragging around years of history, and the Strong Consistency available in Enterprise is a big deal.

Although this creates an annoyance for you at 3.16, we just did it at the 4.0 break. Hopefuly you’ll move not just to 3.16 but to 4.0, the CE changes are slight and full compatibility is there. The water’s fine in 4.0! Come on in.

Topic		Replies	Views
Memory usage/ leak (bug)	17	5142	August 19, 2015
Asd use much memory than i expected Configuration	3	831	November 21, 2019
Aerospike server is getting killed because of "out of memory". Math not working out	2	2595	February 17, 2016
Asd invoked oom-killer	2	746	September 24, 2021
OOM killed. How to file a bug properly Operations	7	2045	January 24, 2017

Growing asd process on only one node

Related topics