Debugging read performance slowdown during large write burst


#1

Hi there, I’m running Aerospike 3.7.4.1 and am having some trouble tuning my read and write workloads to play nicely with each other. The cluster consists of 6 machines on Google Compute Engine with 2x NVMe SSDs apiece.

Under normal circumstances the cluster is able to service about 15K TPS of writes. I got this statistic by checking stat_write_reqs over time. We have a certain batch process that runs that causes a high write volume for a short amount of time. Holistically, I believe that the writes generated by this process affect more bins than the normal writes but I can’t confirm this. (Is there a histogram measuring the size of each write being serviced at any given point in time?)

During the batch process, the write TPS goes up to about 25K (see [1] top-left quadrant) and suddenly requests for reads begin to get queued up (see [1], bottom-right quadrant, a graph of the batch_queue metric as we only use BatchGet() for reads.) At the exact same time we observed a bump in read duration and a big increase in the ‘await’ I/O stat (see [2]) that were strongly correlated.

So the hypothesis is this: During the write storm, I/Os for the writes unfairly beat out queued read operations and we thus have operational delays in read operations that occur during the write storm. Therefore we would like to find a way to favor reads over writes at all times so that a write storm cannot impact the duration of a read. How can we rate limit total write I/O to this end?

My initial path forward was that I noticed the ‘transaction-queues’ parameter which is recommended to be set to the number of cores on the machines. Each of our 6 cluster members has 8 cores, so I set this to 8 and did an A/B test of members. It doesn’t seem to make much of a difference. So that’s why I’m putting this post out – is there another well defined way to do this?

PS. We also tried submitting write requests with MEDIUM priority and read requests with HIGH. This didn’t seem to do much. I think that queueing is not so much a problem as I/O volume on the SSDs. If we can defer writes for the sake of keeping reads performant, that’s what we want. We don’t care if writes take a long time to turn around.

Thanks in advance for any advice that can be provided here.

[1]

[2]


#2

Often the more expensive part of writes is the associated read to merge with the record on disk. If your batch process is intended to overwrite records, be sure to use the replace flag in your write policy.

http://www.aerospike.com/apidocs/java/com/aerospike/client/policy/RecordExistsAction.html#REPLACE