Rolling restarts without Quiesce and "fail_key_busy" - Corelation or causation?

In a cluster of about 20 machines, we were having a fairly consistent rate of “fail_key_busy” errors.

We needed to a rolling restart, and at the time our Ansible code did not quiesce each node before restarting it.

After the most recently rolling restart the number of fail_key_busy alerts per day began to increase steadily.

Is there any causation there?

Well fail_key_busy is pushing back because of contention. If you have nodes out or have ongoing migration, the cluster is under increased stress. It does make some sense that there is an increase, as transactions will take longer. You can read more about tuning here Hot Key error code 14