We have 5 nodes running Aerospike Community Edition build 6.4.0.15 on the aarch64 architecture.
All nodes are set to data-in-memory mode, and we use a lot of Lua UDFs to achieve higher performance of data computations on where the data is stored.
However, I am experiencing terrible latency issues. All nodes have low CPU utilization, not bad network performance, and low I/O utilization, but I am unable to increase the TPS, and the load average is high.
Based on this information, I suspect that the transactions might be blocked, which could be a result of hot keys. However, when I took a performance record on the same node and generated a flame graph:
it seems that the majority of the CPU time and block wait are in the Lua code called from udf_master , which contradicts the previous conclusion.
fasd-fire.svg.tar.gz (210.5 KB)
I have spent a significant amount of time on this issue, and now I feel that I should seek help from experienced developers here. Anyone who can provide me with some guidance and suggestions would be greatly appreciated.
Hi pgupta.
Thank you for your reply on this issue. My friend had previously provided a similar suggestion, and I’ve already tried it. I changed the logging level for the UDF using the following command:
Based on the logs, the Lua state cache appears to be functioning adequately for our current needs. The cache hit rate is consistently high, and the miss count is not growing.
I appreciate your help on this. If you have any additional suggestions, I would be grateful to know.
Its possible to do a lot of weird things in the UDF. Would you mind sharing the lua code for your function? Is it possible to reproduce slow behavior in a test cluster with conditions you can share?
Sorry, I can’t share the Lua code - my boss doesn’t allow it. But I can tell you that the Lua code is mostly about creating lots of bins in the record, and doing some data processing and math calculations inside it. Apart from the record data, there aren’t any other dependencies.
I can’t always reproduce this problem in other environments. Like the first monitoring graph shows, not all the nodes are as bad as the problem one. This seems to be related to the data being processed - I’m guessing it might be some hot keys or large records, but my analysis so far hasn’t been able to confirm that.
I’ve tried printing out the PK and latency for every access on the client side. Then I used client.get_key_partition_id to figure out which partition each PK is in, and matched that up with the partition-to-node mapping in replica-all to see which node each PK is on. I was hoping to use that to track the slow queries and hot keys across the nodes, but the results showed a uniform distribution across the nodes, so I’m still kind of stumped.