[AER-4737] Potential (to drop) dropped packets when coalescing fabric messages

Guy_Sela · February 10, 2016, 1:03pm

This is a bugfix that was inserted in 3.7.3.

In which version was this bug inserted?
What are the symptoms of this issue?

I am using a 3 nodes cluster and recently upgraded to 3.7.2. Since the upgrade, when I take one of the nodes down, I have a short period of 2-4 seconds where I experience data loss in a get() operation, it returns with an empty value, and only after 2-4 seconds, this record is getting fetched successfully. This only happens in one of the nodes, while the other available node is fine. I am using replica factor = 3 on this data.

kporter · February 10, 2016, 6:01pm

The bug predates 3.0.0.

The likely manifestation would be transaction timeouts and cluster disruptions. For most users this issue will either be rare to non-existent. The problem becomes more common when individual records grow beyond 128KB, at which time fabric allocates a new buffer for this message when replicating or proxying to another node within the cluster. Due to a logic error in the code that checks if we had begun filling the 128KB pre-alloced buffer there could be up to 128KB worth of messages dropped.

Are you starting the client when the node starts, and is that client seeding from the returning node?
Is there an error code returned back?

Guy_Sela · February 14, 2016, 2:11pm

Okay I understand the reason for the bug now. It is related to the TTL of the records.

Since 3.6.x when you run ‘service aerospike stop’, while a client is attempting a put() operation, the put() will be stuck, without retries, until the socket is Timedout. In our configuraiton it is 5 seconds, and the TTL of the record was 2 second, this is the reason for the gap of 2-4 seconds between “successful” get() that I described earlier.

What is the reason for this change? The server doesn’t close the sockets properly in ‘service stop’, which causes the clients to be stuck on the operation until socket timeout.

Topic		Replies	Views
Stops responding on sindex drop (may be related to AER-4458)	5	1569	October 13, 2015
Aerospike Node Entering and Exiting the Cluster Frequently Configuration	9	1948	July 1, 2017
Aerospike migrations issue/ data loss Migration query	12	1795	July 15, 2019
Inconsistent result if fetching a key when 1 node crashed on 4 node Aerospike cluster (3.9.0) AQL	31	3969	October 14, 2016
Performance Degrades after restart and during migration Tuning	2	1319	August 16, 2014

[AER-4737] Potential (to drop) dropped packets when coalescing fabric messages

Related topics