It looks like your node is not able to keep up… increasing max-write-cache would just ‘hide’ the errors but is an indication the storage subsystem is not keeping up. Try capping the throughput to get to a point where you can sustain the workload.
Reducing the write-block-size would end up reducing the post-write-queue… so you can ‘help’ by increasing the post-write-queue to help the storage sub-system and reducing the throughput.
We are using 3.12 version and we did make post-write-queue as 2048(maximum), still it was not able to stabilise. Also we are using async client for benchmarking and in that case how can we reduce throughput ? [ AFAIK we can give concurrency parameters but throughput might be governed by the latency. ]
Also, we noticed a pattern that as soon as read disk ops come into picture that is when we start seeing drops and then that mountain pattern continues.
UPDATE: The issue was happening due to wrong configuration recipe, we changed the above configuration to device instead of file and this seems to work well.
Thanks for the update. Here is the doc for the benchmark tool. Specifically the -g config for limiting throughput and -z for controlling the number of threads.
Check the read-page-cache parameter as well. The file config would have given you some of that and I am surprised it is making such a difference for you… maybe if you are a bit limited in RAM and the page cache churn is hurting at some point? In which case having device rather than file would give more consistent/predictable performance.