rw_err_ack_nomatch errors


#1

Hi

We are monitoring our Aerospike cluster through metrics(collectd plugin) and see ‘rw_err_ack_nomatch’ errors frequently.

Metrics docs suggest that it’s “Number of prole write acknowledgments started but went amiss/have mismatched information”.

The meaning is not clear to me. I have replication factor of 3. Does this mean that while writing, it couldn’t be successfully written to replicas ?

Does this indicate write loss or just acks or any inconsistency between master and replica nodes ? Does it point to flaky network or some problem in my config ? Should I be worried about it? Also, All my err_write* metrics have zero values.

Can someone please help in clarifying about this metrics and should I be worried about this metrics ? Thanks!


#2

Any pointers on understanding about this metrics shall be helpful :smile: I am regularly getting these errors. Thanks!


#3

The counter increases when a replica finishes work, acknowledge the master that it has finished, but realizes the transaction has already finished.

The case where this happens is that there is a retry message sent from master to the replicas - replica will receive the message 2 times, and both times will acknowledge back to master. First time, master completes the transaction. Second time, the counter will increase, and the ack will be ignored.

This is not a big concern. If it really happens very often, then you may want to look into why master wants to retry very often (default setting is 1 second).