[AER-4737] Potential (to drop) dropped packets when coalescing fabric messages

This is a bugfix that was inserted in 3.7.3.

  1. In which version was this bug inserted?

  2. What are the symptoms of this issue?

I am using a 3 nodes cluster and recently upgraded to 3.7.2. Since the upgrade, when I take one of the nodes down, I have a short period of 2-4 seconds where I experience data loss in a get() operation, it returns with an empty value, and only after 2-4 seconds, this record is getting fetched successfully. This only happens in one of the nodes, while the other available node is fine. I am using replica factor = 3 on this data.

The bug predates 3.0.0.

The likely manifestation would be transaction timeouts and cluster disruptions. For most users this issue will either be rare to non-existent. The problem becomes more common when individual records grow beyond 128KB, at which time fabric allocates a new buffer for this message when replicating or proxying to another node within the cluster. Due to a logic error in the code that checks if we had begun filling the 128KB pre-alloced buffer there could be up to 128KB worth of messages dropped.

  1. Are you starting the client when the node starts, and is that client seeding from the returning node?
  2. Is there an error code returned back?

Okay I understand the reason for the bug now. It is related to the TTL of the records.

Since 3.6.x when you run ‘service aerospike stop’, while a client is attempting a put() operation, the put() will be stuck, without retries, until the socket is Timedout. In our configuraiton it is 5 seconds, and the TTL of the record was 2 second, this is the reason for the gap of 2-4 seconds between “successful” get() that I described earlier.

What is the reason for this change? The server doesn’t close the sockets properly in ‘service stop’, which causes the clients to be stuck on the operation until socket timeout.