System_free_mem_pct below what AMC shows

Hi all,

New Aerospike user in the Community, and I need some help! I’ve been running into an issue with memory on my cluster. Aerospike reports that the service is using about 50% of RAM, but system_free_mem_pct is showing in the realm of 60%+.

Has anyone ran into this issue before? The only way I’ve been able to get memory usage down is to cycle the Aerospike service on the nodes about once every 15 days. If I’m doing something drastically wrong here, please let me know!

Aerospike reports it is using 50% of the ram you have given it. The system (your actual SERVER not Aerospike) still has 60% available. These are separate metrics. I think you are thinking these are part of the same metric and not adding up.

I see what you are saying. However, when I check how much memory the ASD process is using on the server, it’s closer to the 60%, not the 50% mark.

the 50% mark is data stored in memory. There is some overhead of running aerospike that’s not allocated directly to storing data. That’d be my guess. Unfortunately they don’t expose metrics on how much overhead the daemon is using outside of memory data allocation stats.

That is unfortunate. Currently, the “data stored in memory” is showing 58% while system is showing 65%. Do you know if Aerospike cleans up overhead when it reaches a “dangerous” level (i.e. service crash)? Also, my HWM for memory is set to 70%. Usually, this is about the time I’d cycle the service on the node to free up some memory.

What you tell Aerospike in terms of RAM for a namespace is what Aerospike goes by in terms of percent of memory used and HWM etc. for that namespace. For eg. if you have 2GB actual RAM on your server and you configure 8GB RAM in a namespace - Aerospike will not check against your hardware if your config is right. In this case, if you are storing data in RAM, it will start consuming RAM thinking it has 8 GB available and you will crash your node. The burden is on you to allocate RAM to namespaces that is physically available above and beyond the RAM the Aerospike process and other system processes will consume. The namespace RAM allocated must account for the RAM the data will consume (if storing data in RAM), the Primary Indexes will consume and the Secondary Indexes (if any) will use. See capacity planning link on Aerospike Website. http://www.aerospike.com/docs/operations/plan/capacity

Right, I understand that. However, my concern is that the system memory that Aerospike is actually using (including overhead) might go beyond where my HWM is set. Currently, we only store indexes in RAM. Data on SSD.

HWM applies to the namespace. Once you hit the HWM on a namespace in a node, Aerospike will start evicting master data closest to expiration on that node (and corresponding proles on replica nodes). If that does not help, you will eventually hit stop writes - Aerospike will enter read only mode on your data for that namespace.

Okay, we only have one namespace. So, what you are saying is, that as long as we don’t allocate more RAM than we have in our aerospike config, we should never actually crash. We may hit read-only mode due to capacity restrictions, but never a full out crash.

Your server - physically should have RAM for OS, all other applications/daemons you may be running on it, at least 2GB for Aerospike daemon plus whatever RAM you allocate in the Aerospike config file for each namespace. If you have one namespace, say has 4GB RAM specified (default), then that much RAM should be available in your server on top of the OS+daemons. Give yourself some headroom too. So, if nothing else is running on your server besides Aerospike, I would say you could run Aerospike with a 4GB RAM single namespace defined in the config file on a 8GB RAM machine.

Regarding the other part - about crashing - the only scenario that you must also tune is the namespace supervisor thread period. By default it runs every two minutes, round robin over all namespaces. That is the thread that detects HWM violations. grep for thr_nsup in /var/log/aerospike/aerospike.log and see total time it takes to run over each namespace and make sure your nsup period is higher than that. In that period, your data rate of updates to records should not exceed such that you blow past HWMs before nsup has a chance to run.