Latency issues: High latency despite idle system resources - Seeking troubleshooting guidance

lktoken · July 23, 2024, 2:55pm

We have 5 nodes running Aerospike Community Edition build 6.4.0.15 on the aarch64 architecture. All nodes are set to data-in-memory mode, and we use a lot of Lua UDFs to achieve higher performance of data computations on where the data is stored. However, I am experiencing terrible latency issues. All nodes have low CPU utilization, not bad network performance, and low I/O utilization, but I am unable to increase the TPS, and the load average is high.

I used the asloglatency tool to get the following results:

The latency is high for udf-start and udf-restart , but low for udf-master . It’s same for both read and write operations.

Based on this information, I suspect that the transactions might be blocked, which could be a result of hot keys. However, when I took a performance record on the same node and generated a flame graph:

it seems that the majority of the CPU time and block wait are in the Lua code called from udf_master , which contradicts the previous conclusion. fasd-fire.svg.tar.gz (210.5 KB)

I have spent a significant amount of time on this issue, and now I feel that I should seek help from experienced developers here. Anyone who can provide me with some guidance and suggestions would be greatly appreciated.

pgupta · July 23, 2024, 6:46pm

Please see if this is your issue. If yes, try the suggested solution…

lktoken · July 24, 2024, 2:07am

Hi pgupta. Thank you for your reply on this issue. My friend had previously provided a similar suggestion, and I’ve already tried it. I changed the logging level for the UDF using the following command:

asinfo -v "log-set:id=0;udf=info"

The resulting logs are attached:

Based on the logs, the Lua state cache appears to be functioning adequately for our current needs. The cache hit rate is consistently high, and the miss count is not growing.

I appreciate your help on this. If you have any additional suggestions, I would be grateful to know.

Albot · July 24, 2024, 3:50am

Its possible to do a lot of weird things in the UDF. Would you mind sharing the lua code for your function? Is it possible to reproduce slow behavior in a test cluster with conditions you can share?

lktoken · July 24, 2024, 6:43am

Sorry, I can’t share the Lua code - my boss doesn’t allow it. But I can tell you that the Lua code is mostly about creating lots of bins in the record, and doing some data processing and math calculations inside it. Apart from the record data, there aren’t any other dependencies.

I can’t always reproduce this problem in other environments. Like the first monitoring graph shows, not all the nodes are as bad as the problem one. This seems to be related to the data being processed - I’m guessing it might be some hot keys or large records, but my analysis so far hasn’t been able to confirm that.

I’ve tried printing out the PK and latency for every access on the client side. Then I used client.get_key_partition_id to figure out which partition each PK is in, and matched that up with the partition-to-node mapping in replica-all to see which node each PK is on. I was hoping to use that to track the slow queries and hot keys across the nodes, but the results showed a uniform distribution across the nodes, so I’m still kind of stumped.

pgupta · July 24, 2024, 3:40pm

Lots of bins? Hope you are not exceeding 512 bins max for UDFs.

lktoken · July 25, 2024, 1:23am

Yes, not exceeding, around several tens. Thanks for your attention.

system · July 25, 2025, 1:24am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to reduce latency on production cluster	2	1907	August 22, 2015
High Query latency Query & Indexing query , udf	5	4805	April 19, 2016
UDF performance Tuning	4	2454	June 22, 2015
Low efficiency and timeout exception of udf for java client User Defined Functions (UDF) java , udf	9	1454	December 31, 2019
Higher Latencies In Few Particular Nodes query , udf , latency , index	4	820	April 19, 2022

Latency issues: High latency despite idle system resources - Seeking troubleshooting guidance

Related topics