We are using aerospike in our organization extensively. We are facing some issues frequently nowadays. no of connection to aerospike is increasing suddenly from clients. It is reaching 15K connection and aerospike node stopped working. We had to restart the aerospike service. Most of the connections are coming from one of our netty servers. We upgraded our client version from 3.3.4 to 4.1.8. But we are facing the same issue.
ClientPolicy.maxConnsPerNode (default 300) defines the upper bound of connections allowed to each to node. It appears that maxConnsPerNode has been set to a much higher value on your netty server/aerospike client.
We have not set any value to ClientPolicy.maxConnsPerNode. So It should be 300 by default per node. 90% of connections are close_wait connections and our cluster is going down because of connections. We had to restart the service frequently.
Many sockets in close/wait state usually means there are lots of timeouts occurring. A client timeout will force the socket to close. If the socket wasn’t closed on a client timeout, data from the previous transaction could be received in the next transaction when the socket is reused.
Some ways to solve this are:
Increase transaction timeout values (Policy.socketTimeout and Policy.totalTimeout), so that timeouts occur less frequently.
Add extra hardware resources (machines, network bandwidth, etc…) to increase capacity, thus alleviating bottlenecks/timeouts.
Another reason for large numbers of closed sockets is ClientPolicy.maxSocketIdle (default 55 seconds) is greater than the server’s proto-fd-idle-ms (default 60 seconds). ClientPolicy.maxSocketIdle should always be less than proto-fd-idle-ms.
I checked the logs from clients. We are getting java.io.EOFException in the client side.
c.z.a.d.m.MomentsMarketLoaderDelegate - Exception while iterating from aerospike record set for moments marketing
com.aerospike.client.AerospikeException: Error Code -1: java.io.EOFException
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:139) ~[prod-adtech-adserver.jar:na]
at com.aerospike.client.command.MultiCommand.execute(MultiCommand.java:75) ~[prod-adtech-adserver.jar:na]
at com.aerospike.client.query.QueryExecutor$QueryThread.run(QueryExecutor.java:146) ~[prod-adtech-adserver.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]
Caused by: java.io.EOFException: null
at com.aerospike.client.command.MultiCommand.readBytes(MultiCommand.java:223) ~[prod-adtech-adserver.jar:na]
at com.aerospike.client.command.MultiCommand.parseResult(MultiCommand.java:88) ~[prod-adtech-adserver.jar:na]
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:84) ~[prod-adtech-adserver.jar:na]
... 5 common frames omitted
Can you explain why it is happening? I might be the root cause of the above issue.
EOFException means the server closed the connection. By default, the server will close any open connection that has been idle for 60 seconds. By default, the server will also close connections where the socket is idle for 10 seconds on a running query.
Your iteration of the record set may be backed up and no socket reads are occurring for 10 seconds. It’s important to iterate through the record set as fast as possible.
It’s also critical to catch all application errors during the query and call RecordSet.close() at the end. RecordSet.close() is required on both success and error paths.
and we are overwriting queryPolicy in another aerospike query by default query policy like below
` return new AerospikeSource(MomentsMarketCache.class,
Actually We are doing 3 transaction such as 1 read and 2 query operation. The socket and total timeout for 1 read and 1 query operation is 5 ms and another query operation has a default query policy.
A 5ms timeout on a query is extremely aggressive. A single record get retrieves a single record from a single node. A query is issued to all nodes in the cluster and the client assembles the records into a single record set stream. There is at least an order of magnitude more work for queries than single record gets.
The server will close the connection if a query timeout occurs, so that is likely why you are getting the EOFExceptions. I suggest raising the query timeout to a value much much higher than it’s currently set.