Java Async Client sending Syn within 3 seconds

We are using Java Async Client 4.4.1. We are running on Azure VMs We have set timeoutDelay to 10sec and I think the connection_timeout_millis is 30 Sec(30000 milliseconds) by default which we haven’t changed.

We have set totalTimeout of 10ms. Client Policy timeout is also set to 10 Seconds.

Now the problem is that the Server goes into a weird state where the SYN_RECV sockets reach the limit of 512 and the N/W on the server chokes. I know this happens when there are too many RST requests which is what we see in the client TCPDUMP attached below. The client sends SYNs and eventually if server responds with SYN,ACK the client sends RST packet.

What I want to understand that why is the client not waiting for [SYN,ACK] for 30 seconds before re-firing SYNs which is firing every <3 seconds.

The client waits 10ms for the connection to complete because your totalTimeout (10ms) overrides socketTimeout (30sec) when totalTimeout is less than socketTimeout. timeoutDelay (10sec) should help to recover these connections on timeout, but is apparently not effective enough. Note that connection recovery will not work if the sever node is down.

I suggest the following:

  1. Test with the latest java client version.

  2. Uncomment all of the debug statements in ConnectionRecover.java to gain more insight into what’s happening during the recovery process.

  3. After initial tests, use a larger totalTimeout that doesn’t result in an excessive number of timeouts.

@Brian It seems like timeoutdelay is not used for recovering connectionFailures. The state is CONNECT for when that happens.

That is correct. The connection recovery works for command timeouts on existing connections, but not for timeouts on the connection creation itself.

In general, recovering connections on creation is less fruitful because timeouts on connections usually means the node address is bad or the node is down and the connection can’t be recovered. Connection creations that exceed the full timeout value for the command on active nodes are less likely.

@Brian This leads to the same problem as solved by timeoutDelay in case of cloud VMs. The N/W cards choke and the ping latencies go haywire.

I’m currently busy with other projects, but when I get some free time, I will investigate recovering from connect timeouts when using netty.

Note that timeoutDelay helps mitigate connection churn, but it doesn’t solve the real problem; configured timeouts are too low in your environment. They should be adjusted to where timeouts are a rare occurrence.

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.