Very High memory usage of Aerospike Server while performing repetitive MAP operations via UDF

Aerospike Server - 3.16.0.6 Aerospike Client (C) - 4.3.6 Storage used - In Memory

Record - One integer bin and one MAP bin

RUN 1 - Using UDF, adding new MAP entries. Each map entry is of type string, having 500 length. This UDF was called for 1500 times, so at the end, MAP has 1500 entries.

I am observing following -

  • TOP command reports memory utilisation for Aerospike to be about 850 MB. Initial (just before starting UDF operations) memory utilisation of Aerospike was about 104MB.
  • STAT NAMESPACE command from AQL reports

±--------------------------------------±------------+ | name | value | ±--------------------------------------±------------+ | “ns_cluster_size” | “1” | | “effective_replication_factor” | “1” | | “objects” | “1” | | “tombstones” | “0” | | “master_objects” | “1” | | “master_tombstones” | “0” | | “prole_objects” | “0” | | “prole_tombstones” | “0” | | “non_replica_objects” | “0” | | “non_replica_tombstones” | “0” | | “stop_writes” | “false” | | “hwm_breached” | “false” | | “current_time” | “277395427” | | “non_expirable_objects” | “0” | | “expired_objects” | “0” | | “evicted_objects” | “0” | | “evict_ttl” | “0” | | “nsup_cycle_duration” | “0” | | “nsup_cycle_sleep_pct” | “0” | | “truncate_lut” | “0” | | “truncated_records” | “0” | | “memory_used_bytes” | “766988” | | “memory_used_data_bytes” | “766924” | | “memory_used_index_bytes” | “64” | | “memory_used_sindex_bytes” | “0” |

Other observations

  • On record deletion, memory utilisation reported by TOP command shows very little variation
  • Memory utilisation reported by TOP doesn’t go down even after waiting for more than 5 min.

RUN 2 - When same record is created using a single write operation with MAP containing 1500 entries

  • TOP command reports memory utilisation for Aerospike to be about 110 MB. Initial (just before starting UDF operations) memory utilisation of Aerospike was about 105MB.
  • STAT NAMESPACE command from AQL reports ±--------------------------------------±------------+ | name | value | ±--------------------------------------±------------+ | “ns_cluster_size” | “1” | | “effective_replication_factor” | “1” | | “objects” | “1” | | “tombstones” | “0” | | “master_objects” | “1” | | “master_tombstones” | “0” | | “prole_objects” | “0” | | “prole_tombstones” | “0” | | “non_replica_objects” | “0” | | “non_replica_tombstones” | “0” | | “stop_writes” | “false” | | “hwm_breached” | “false” | | “current_time” | “277396540” | | “non_expirable_objects” | “0” | | “expired_objects” | “0” | | “evicted_objects” | “0” | | “evict_ttl” | “0” | | “nsup_cycle_duration” | “0” | | “nsup_cycle_sleep_pct” | “0” | | “truncate_lut” | “0” | | “truncated_records” | “0” | | “memory_used_bytes” | “766988” | | “memory_used_data_bytes” | “766924” | | “memory_used_index_bytes” | “64” | | “memory_used_sindex_bytes” | “0” |

The same behaviour was also observed on Server version 3.7.4

Can someone provide explanation for this?

It’s possible that the Lua cache is consuming a lot of memory. The server will cache up to 128 Lua states if enough copies of the same UDF are executing concurrently on the same node.

Un-registering (removing) the relevant UDF module should destroy any cache entries for it. Then wait 5 or 10 minutes for the Lua garbage collection to complete.

As a test, you might also try running without the UDF cache, though this will impact performance. In the mod-lua {} section of the server config file set cache-enabled false, and restart the server.

Hello Brad,

I tried by setting udf-cache off. And, Memory utilisation seems to be in control. But I have some questions on udf cache

  • With UDF cache on, how come memory utilisation goes by such a high value? (difference in memory utilisation between cache off and cache on is about 700M, without cache, aerospike consumes only about 120MB, with cache consumption goes up drastically, 850MB for 1500 udf invocations, and it goes in GBs for even higher number of UDF invocations)

  • Another observation is, memory utilisation with caching depends on size of the data getting handled (stored in the record) by that UDF. For 1500 UDF invocation, for 20 bytes of string size, Aerospike utilisation is around 280MB, for 200 bytes of string size, Aerospike utilisation is around 500MB, for 500 bytes of string size, Aerospike utilisation is around 850MB. In any of the scenarios described above, actual memory used for data never crosses 2-3 MB as per namespace stat.

Does udf cache also contains actual data getting handled or is it just pure byte code of UDF, cached for optimisation?

What is the recommended approach when a particular UDF is going to be called very large number of times (like 1000s of invocations per sec)?

Hi Brad, I will try this out, but I have doubts if this is because of Lua cache. Firstly, map additions are happening at the rate of 20 per second, i.e there is 50ms interval between 2 successive map operation. So, I don’t expect lot of parallel Lua executions. Another observation is that, memory utilisation is directly proportional to number of map entries, when map entries are more than 3300, it consumes 2GB+ memory as per top output. These entries are again created with constant rate of 20 per second. So, there is some correlation with number of actual write operations performed on that record.

Regards, Ajinkya Joshi

Hello,

The cache entries can be surprisingly large, as Lua states contain libraries and other overhead. The state does contain data being used, but the fact that your memory usage remains high after the UDFs have completed, and the dependence of memory usage on number of UDF calls and size of strings, suggests that garbage collection is very slow, or that there may be a leak somewhere.

Have you tried unregistering the module (“remove” from AQL) after running the UDFs, and waiting a while? That should return the memory.

The caching system is a tradeoff between performance (speed) and memory usage. If speed is not important when creating this record, you could disable the cache. Or, you could use the cache but reduce the UDF invocation rate so that the cache is less likely to reach its maximum of 128 entries.

Depending on your needs, it could also be worth considering using the CDT functionality to manage your maps, instead of using UDFs.

Hello Brad,

I can confirm that after removing UDF from the server, memory utilisation drops down significantly. I have following questions on UDF usage -

  1. If there is limit of 128 entries in the UDF cache, memory increase that I am observing should also stop once 128 cache entires are created. But, what I am observing is, memory utilisation depends on number of udf invocations and I have seen it grow to 2 GB+ for 2000+ udf invocations.
  2. Why don’t those cache entries get flush? I don’t see any GC setting related to udf cache.
  3. Is there any way to change this number of 128 entries?
  4. Could you explain what is the real benefit of storing actual data in the cache? Because, invoking one udf function for 1500 times, for creating 1 record of size ~750KB , end up consuming ~700 MB of cache. This doesn’t sound correct to me.
  5. Does memory consumed by cache depend of how many records are getting accessed? Like, if same udf is invoked by different records vs just 1 record.

You mentioned about using CDT operations for MAP management, but UDF provides a unique feature which other operations don’t provide, i.e. reading and updating record in atomic operation. So, in that way UDFs are very useful and may not be replaceable by other equivalent operations.

One more observations -

On executing same number of UDF operations on different records didn’t cause any spike in memory utilisation. Is the cache related to udf/record combination?

Hi Ajinkya,

  1. Once the UDFs have all completed and there are 128 cache entries, yes, the memory used by the cache should not grow if more UDFs are subsequently run and completed. If memory does continue to grow, it suggests that memory is leaking somewhere, or that Lua’s garbage collection is not working properly.
  2. Every UDF invocation needs a Lua state to execute in. If there’s a cached state available, the UDF takes it and uses it. Otherwise, a new Lua state is generated (which allocates memory, and of course takes a little bit of time) and the UDF uses that one. When the UDF completes, memory associated with the record it operated on internally should be freed, partly by calling Lua’s garbage collector, which should free the storage used for manipulating map data within the UDF. Then the Lua state is returned to the cache, unless there are already 128 entries in the cache, in which case the Lua state is simply closed, which frees up everything used by it (and this may also depend partly on garbage collection in Lua).
  3. The 128 entry limit is currently hardcoded, though it may become configurable in a future release.
  4. The cache should not store record data. A cached Lua state should only contain basic structural items, essential internal libraries, and any libraries needed by the UDFs contained in the Lua module associated with the cache entry (i.e. the Lua source file). The size of the cache entry varies of course, but 700 MB does seem too large even for 128 entries, and suggests a likely problem somewhere.
  5. Nothing specific to a particular record should be cached, so it shouldn’t matter how many records are accessed by the UDF calls.

Since you did not observe memory bloat when varying the records used by the UDFs, there must be a problem in the single-record case. If you can provide some UDF code and describe how you are calling the UDF, I can try to reproduce the situation and investigate. [Are you running a single-node cluster?] Or if you’d prefer, I can help you use some builtin memory accounting server functionality that has proven useful for identifying leaks.

Hello Brad,

You can provide me with commands/setting to enable builtin memory accounting. I am currently running single node cluster. Though this was seen in multi-node cluster as well.

Hi Ajinkya,

Make sure you have the “asparsemem” tool, which is a Python script. I don’t think it is part of the standard community download, but if not you can find it on github at asparsemem

In the service {} section of the server config file, set “debug-allocations all” and restart the server.

Then run the UDFs until the memory usage is very high, as you have done before. When the memory usage is high and the UDFs are not running, issue the following info command to the server:

asinfo -v 'jem-stats:file=/tmp/jem.txt;sites=/tmp/sites.txt'

Note you can specify any path for the file and sites output files. The first file is readable and contains high-level allocation summaries, but the information about where memory was allocated but not (yet) deallocated is contained in the second file, which is less readable. Use the asparsemem script on it like this:

asparsemem --in=/tmp/sites.txt --out=/tmp/parsemem.txt

The resulting output file should contain entries that begin with a code location (file:line number) and end with a byte count. The biggest byte counts are the more interesting ones, so you could sort by the last number:

sort -n -t\| -k 3 /tmp/parsemem.txt

Hopefully this will show where those hundreds of megabytes were allocated and lead us to possible leaks. Good luck!

Hello Brad,

I have enabled memory allocation debugger and used the asparsemem tool. Detailed result along with the UDF code can be found over here.

In a nutshell I see following two as culprit - /work/source/modules/common/src/main/aerospike/as_string.c:27|0x55543d|108072000 ??:0|0x7f40b967c77a|593484440

There are few instances like this where allocation details (filename) is not getting displayed.

Client side logic -

Single record - Call the same UDF periodically (every 50 ms) using C client async API. It passes an integer argument, a unique string for map key and a string of size 500 bytes as argument to the UDF. This UDF is called for 1500 times. This accumulates total of 850+ MB (post experiment) for Aerospike memory, with initial memory of around 110MB (pre experiment).

Multiple records - Follow same procedure for single record, except that for every UDF execution new record is created. So essentially, there are 1500 records instead of 1 records having 1500 MAP entries. Aerospike memory goes upto 125MB (post experiment), with initial memory of around 110MB (pre experiment)

Hi Ajinkya,

Thanks for providing that information. It does look like there could be a leak, and I’m trying to create a good reproducible case here… could you provide your server config file? Also, are the map values unique 500 byte strings, or are they all the same? What size are the map key strings?

Thanks!

Hello Brad,

Server side config is what comes by default with 3.16.0.6 distribution. I am using test namespace for the experiments.

on the actual data, 500 bytes of string could be anything, I have seen it with unique strings as well as same string for all the entires. MAP key is around 20 bytes.

Hi Ajinkya,

We have a good reproducible case now. It does appear to be a Lua garbage collection issue. We sometimes rely on Lua’s garbage collector to report Lua memory usage when deciding whether to request garbage collection, and it seems that memory usage is being under-reported (that is, Lua does not claim responsibility for much of the map data), and thus we don’t request garbage collection frequently enough.

We’re still investigating this, and will make changes to the server as needed to improve Lua garbage collection in an upcoming release. Meanwhile, for a workaround you could consider using Lua’s collectgarbage() function. For example when I use collectgarbage(“step”, 40) in your add_request() function it keeps the memory usage low (and takes perhaps 1 or 2 ms per call). Note also that you can use collectgarbage(“count”) to get Lua’s reported memory usage in KB (it’s this value we think is too low).

Hi Brad,

Great that you can reproduce the issue locally. I have few questions based on what you have shared so far.

  1. If lua is under reporting memory utilisation, why does this issue not observed on disabling lua cache?

  2. Why is this issue not seen when MAP entries are spread over multiple records ?

Hi Ajinkya,

  1. When the cache is disabled, after a UDF completes we close the Lua state, which initiates Lua’s garbage collection. We only depend on Lua’s reported memory usage for initiating garbage collection when the cache is enabled (and thus we cache states rather than close them).

  2. We suspect that the underreporting may be related to reallocations associated with expanding a map as it gets more and more entries. We’re still investigating and trying to confirm this. When multiple records are used, any particular record’s map will be small and won’t need to be expanded and reallocated.