PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22141 root 20 0 22.3g 15g 2180 S 330 48.2 34870:28 asd
I have a cluster of 7 nodes, and one namespace which contains 6GB of data, replication factor 2 (14Gb distributed on the cluster). The situation is identical on all nodes.
The only solution I found to recover the memory is to perform a rolling restart of aerospike daemons.
Which features are you using? Secondary Indexes, UDFs, LDTs, lists and maps?
If you are using Secondary Indexes, are you indexing lists and maps (indexing these types hasn’t yet been officially announced), if so there was a memory leak fixed for this type of index in 3.5.14.
We also noticed you have ~7000-11000 client connections per server. This isn’t necessarily a problem as we have users that drive more than this, but if you aren’t expecting this number of client connections then it may help us narrow the search.
How many client instances are you running?
Per client instance, how many threads are they running?
Are your clients running in asynchronous or synchronous mode?
Also while the memory is growing can we try the following:
run:
for i in {1..12}; do asinfo -v "jem-stats:"; sleep 600; done
This should take 2 hours to complete and will be dumping additional memory usage stats to a special console every 10 minutes.
After the above comand completes could you provide the output the following via pasebin?
The aerospike cluster has 7 nodes. There are 30 nodes which are running PHP code, with around 100 lives PHP workers on each node, but probably around 400 PHP workers launched on it. I do not know how the PHP Aerospike drivers handles connections, but I’m using a standard configuration. The cluster handles around 20k HTTP req / s.
Per client instance, how many threads are they running?
I do not know. The PHP Aerospike drivers should use a pool of connections.
Are your clients running in asynchronous or synchronous mode?
This is looking more like a memory leak, we would like to try to reproduce this locally. We are seeing that the leak seems to be in 64 byte allocations and is growing at a rate of about 20MB per minute, is there an access patter on the client side that may correlate to this allocation?
Could you describe your client app a bit more?
Would you be able to share application/UDF code to help us reproduce this issue locally?
We are seeing that the leak seems to be in 64 byte allocations and is growing at a rate of about 20MB per minute, is there an access patter on the client side that may correlate to this allocation?
The sum of used memory on all aerospike node is pretty stable with time. Data are changing, but not the amount of data.
Could you describe your client app a bit more?
We have three kind of access
standard access with get / put
increment with increment
redis list emulation with UDF : rpush, lpush, rpop, lpop.
Hello. Thanks for sending those answers and especially the UDFs ~~ They really help us understand your use case. We have used them in our own test application and have seen some phenomena that may replicate what you are seeing. We’ll keep working on it and keep you up to date with our results.
One question we have is: If the client load is suspended, does the memory use ever go down, or does it simply stay at the same level? Is there any way you can do that sort of test? (If it does go down, then it might indicate some sort of garbage collection latency, rather than an actual memory leak.)
Do you have any updates on this issue, especially related to whether de-applying client load leads to decreased memory use? We have been accumulating evidence that it is a GC latency issue rather than an actual memory leak. If you let us know what you see in your environment, and it fits the pattern we suspect, then we might be able to provide you with means to address GC latency.
Thanks for the info. Glad it’s working for you now.
Currently, the Lua GC parameters are hard-coded “#define” constants in the code. In the Aerospike Server Community Edition open source code, you can find them using:
There is a comment block at that location in the source code describing what is known about controlling Lua GC. At this point, it would necessitate a custom build to alter these parameters, e.g., to do GC more frequently to reduce GC latency and avoid out-of-memory situations. If 3.5.15 is working for you now, however, perhaps this is not necessary?
Please let us know if you have more questions or comments or updates about your experiences using Aerospike. Thanks!
A few questions:
1). What version of the PHP client are you using?
2). Are you using AS_POLICY_COMMIT_LEVEL_MASTER? (If so, this could cause 100% “writes_master” latency to be erroneously reported.)
Thanks.