Back pressure mechanism in case of DEVICE_OVERLOAD error

Sumit_Kumar · December 19, 2019, 1:18pm

Hi,

I am facing “AEROSPIKE_ERR_DEVICE_OVERLOAD” during writes (using pipeline write) and wondering in case there is any C client API through which client can periodically pull “write-q” (un-flushed write buffers) information from server to determine that device writes are not scaling. This is mainly to implement a back pressure mechanism in the client code to slow down things once the server buffers start approaching the limit. I am using server version 4.3.8.

Thanks

Albot · December 20, 2019, 2:44am

You can fetch those stats from the info protocol. Why not just wait till you hit the error, sleep, and retry?

szhem · December 20, 2019, 8:39am

Hello!

Here is my alternative version of this topic created independently).

I have though about back pressure too, but finally decided that it is not as simple to implement as it seems to be, because

info protocol as @Albot suggested will return write-q value on the per-device basis
single server may contain multiple devices
there are multiple servers in the cluster

So imagine that there is just a single slow disk in the cluster of multiple machines and its write-q increases. How to understand whether to stop writes or to continue if you don’t exactly know which device will be reached by exactly this record?

Moreover in case of info protocol there is no guarantee that you don’t get device overload error between two info requests.

What I’d like to understand in my question here is whether there are any possibilities of loses in case when client writes a lot of data with non-blocking API, fills up write cache (max-write-cache config option), gets an error, then sleeps for some period of time and then retries all the requests which led to device overload errors previously?

pgupta · December 22, 2019, 3:54am

Your client write is not written to the write-block-buffer if the write-q is full. The first thing that is checked once a transaction is determined to be a write transaction is whether the device it is destined for, which is deterministic, is experiencing a write-q full. You can validate by inspecting the CE code.

Sumit_Kumar · December 23, 2019, 2:48pm

Thanks for input guys.

@Albot Yes, retry with delay will be there in case feedback is not working as expected. We are trying to build some proactive mechanism so that client can slow down during pressure situation and restore back once server is settle.

@szhem We will be mostly working with homogeneous set of disks and machines so hoping that they will be operating on same performance level. But you are right, there are variations like hot spot on a disk or background de-fragmentor which can consume certain disks bandwidth disproportionately. We probably need to do more math to first figure out any laggard disk and then calculate overall cluster throughout assuming that all other disks will be operating on same (laggard) disk speed. With this assumption, better disks will be under utilized.

Topic		Replies	Views
Handling "write fail: queue too deep / Error Code 18: Device overload" on the client side Client Libraries	2	2162	January 4, 2020
How to troubleshoot/fix "Device overload"?	14	13749	February 1, 2022
Device overload when map size is too big Tuning map	4	3137	May 28, 2018
Acessing `write-q` using python-client Operations query	5	781	August 5, 2022
Write-q stuck at max	10	1434	June 6, 2023

Back pressure mechanism in case of DEVICE_OVERLOAD error

Related topics