Java Async Client sending Syn within 3 seconds

We are using Java Async Client 4.4.1. We are running on Azure VMs We have set timeoutDelay to 10sec and I think the connection_timeout_millis is 30 Sec(30000 milliseconds) by default which we haven’t changed.

We have set totalTimeout of 10ms. Client Policy timeout is also set to 10 Seconds.

Now the problem is that the Server goes into a weird state where the SYN_RECV sockets reach the limit of 512 and the N/W on the server chokes. I know this happens when there are too many RST requests which is what we see in the client TCPDUMP attached below. The client sends SYNs and eventually if server responds with SYN,ACK the client sends RST packet.

What I want to understand that why is the client not waiting for [SYN,ACK] for 30 seconds before re-firing SYNs which is firing every <3 seconds.

The client waits 10ms for the connection to complete because your totalTimeout (10ms) overrides socketTimeout (30sec) when totalTimeout is less than socketTimeout. timeoutDelay (10sec) should help to recover these connections on timeout, but is apparently not effective enough. Note that connection recovery will not work if the sever node is down.

I suggest the following:

  1. Test with the latest java client version.

  2. Uncomment all of the debug statements in ConnectionRecover.java to gain more insight into what’s happening during the recovery process.

  3. After initial tests, use a larger totalTimeout that doesn’t result in an excessive number of timeouts.

@Brian It seems like timeoutdelay is not used for recovering connectionFailures. The state is CONNECT for when that happens.

That is correct. The connection recovery works for command timeouts on existing connections, but not for timeouts on the connection creation itself.

In general, recovering connections on creation is less fruitful because timeouts on connections usually means the node address is bad or the node is down and the connection can’t be recovered. Connection creations that exceed the full timeout value for the command on active nodes are less likely.

@Brian This leads to the same problem as solved by timeoutDelay in case of cloud VMs. The N/W cards choke and the ping latencies go haywire.

I’m currently busy with other projects, but when I get some free time, I will investigate recovering from connect timeouts when using netty.

Note that timeoutDelay helps mitigate connection churn, but it doesn’t solve the real problem; configured timeouts are too low in your environment. They should be adjusted to where timeouts are a rare occurrence.

I added a 1 second sleep in server function cf_socket_accept() that is triggered when a special info command is run from the command line. I also set the following policy on the client:

p.socketTimeout = 500;
p.totalTimeout = 500;
p.timeoutDelay = 3000;

This scenario triggers a client timeout, but the timeout happens after the connection channel is activated and the command bytes have been written to the socket. The state COMMAND_READ_HEADER is then handled and the connection is properly recovered.

I can’t find a way to induce an async netty client timeout during the connection state. If the server is available but slow, the timeout occurs in the client read state.

If the server is not available, there is no point in trying to recover the connection. On server restarts, the old server instance connections will never become valid in the new server instance.

This leads to my question: How were you able to induce client timeouts during the connection state?

Hey @Brian

We have 8ms totalTimeout and we are on Azure. This was automatically happening for us. Once the traffic would hit a certain threshold.

There is a new java client branch “recover” that expands connection recovery scope to include async netty commands in connect state. Try this branch and let us know if it improves your connection churn.

https://github.com/aerospike/aerospike-client-java/tree/recover