Here is my alternative version of this topic created independently).
I have though about back pressure too, but finally decided that it is not as simple to implement as it seems to be, because
- info protocol as @Albot suggested will return write-q value on the per-device basis
- single server may contain multiple devices
- there are multiple servers in the cluster
So imagine that there is just a single slow disk in the cluster of multiple machines and its write-q increases. How to understand whether to stop writes or to continue if you don’t exactly know which device will be reached by exactly this record?
Moreover in case of info protocol there is no guarantee that you don’t get device overload error between two info requests.
What I’d like to understand in my question here is whether there are any possibilities of loses in case when client writes a lot of data with non-blocking API, fills up write cache (max-write-cache config option), gets an error, then sleeps for some period of time and then retries all the requests which led to device overload errors previously?