java.io.EOFException Error when Aerospike cluster do replication

RockyTu · October 12, 2016, 3:39am

Hi all,

Version: 3.9.0 node : 4

I found when several nodes of Aerospike cluster restart and all nodes added into cluster, it will do replication and migrate. But I found, if the replication was not completed, using java client to write Aerospike will cause java.io.EOFException error. But once the replication is finished, the operation will be okay.

Is that an expected behavior ? How to walk around it , maybe change the write policy ?

ERROR NALIB - Corrupted! key is 11621_1474550400000000 ERROR NALIB - java.io.EOFException com.aerospike.client.AerospikeException: java.io.EOFException at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:91) ~[na-zeromq-client-jar-with-dependencies.jar:?] at com.aerospike.client.AerospikeClient.append(AerospikeClient.java:362) ~[na-zeromq-client-jar-with-dependencies.jar:?] at com.briphant.na.lib.dao.aerospike.AerospikeDAO$1.append(AerospikeDAO.java:786) ~[na-zeromq-client-jar-with-dependencies.jar:?] at com.briphant.na.lib.dao.aerospike.InsertTask.run(InsertTask.java:50) [na-zeromq-client-jar-with-dependencies.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]

Brian · October 12, 2016, 6:31pm

EOFException is thrown by the client when it’s tries to use a socket that was either closed by the server or the server has shutdown. This is expected behavior because the client doesn’t know if an opened socket is valid or not until it tries to use the socket.

There are retry policy settings that will help reads complete if a server node has shutdown.

ClientPolicy clientPolicy = new ClientPolicy(); clientPolicy.requestProleReplicas = true;

Policy policy = new Policy(); policy.maxRetries = 1; policy.sleepBetweenRetries = 0; policy.retryOnTimeout = true; policy.replica = Replica.SEQUENCE;

These settings tell the client to retry once immediately if the read has failed, with the retry occurring on the prole node (assuming server replication factor of 2).

Writes will also retry once immediately with these settings, but the retry would still occur on the downed master node because all writes must always be directed towards the master node. You may want to adjust sleepBetweenRetries for writes to give the servers a chance to reset the data partition map.

There is a lag of approximately 2 seconds between node failure and the client receiving the updated data partition map. Once the client’s partition map is reset, the retry will be sent to the newly assigned node.

RockyTu · October 13, 2016, 6:56am

Thanks Brian I find it take a lot time for “Replica Objects” to catch up with “Master Objects” when one node shutdown or rejoin. Therefore, during this long period, the EOF issue is expected ? If changing the policy as you provided, I can totally solve this issue ?

RockyTu · October 13, 2016, 7:07am

The new node has shutdown and rejoined the cluster very early, but in more than 10 hours, the replication is not completed yet.

The master is 21,773,707 and the Replica object is 21,773,670. But this issue is still triggered. is it expected?

kporter · October 13, 2016, 6:54pm

See below where I addressed a related inconsistency in another thread.

Brian · October 13, 2016, 7:17pm

The EOF issue should only happen when server closes socket or on server shutdown. Migrations should not cause this.

RockyTu · October 14, 2016, 3:51am

Hi Brian, I really found there was some relationship between the EOF issue and Replicas object, but I don’t know the reason. If you have environment, you can test it, the scenario is as following. Community 3.9.0

One new node rejoin a cluster completed. The Replica object start with 0 in AMC
the Replica objects is far less than master objects, the migration is not completed.
Repeatedly read and update many records. The issue may be triggered.

But if the Replica objects catches up with Master object, this issue never triggered.

Brian · October 14, 2016, 5:44pm

When migrations are occurring, server nodes are much more likely to receive requests on the wrong node and they need to proxy to the correct node. For this proxy, the server uses a timeout of 1 second when the original request has no timeout or timeouts greater than 1 second. Otherwise, the timeout is the same as the original request timeout.

If the proxy times out, the server just closes the original connection without notifying the client of a timeout. This will cause a EOF exception on the client. I’m told that the server will be fixed to send a timeout response to the client before closing the connection on proxies.

Topic		Replies	Views
What causes AerospikeException: java.io.EOFException? Java Client	16	9328	December 3, 2018
EOFException with async client of nioEventLoop Operations	3	2466	August 20, 2017
Exception on node restart	12	1726	October 31, 2017
EOFException with async client Operations	2	1252	August 20, 2017
java.io.EOFException happens frequently Operations	5	2326	June 22, 2017

java.io.EOFException Error when Aerospike cluster do replication

Related topics