No of connection ramps up dramatically

Sonaiya_Karthick · August 13, 2018, 6:25pm

Hi,

We are using aerospike in our organization extensively. We are facing some issues frequently nowadays. no of connection to aerospike is increasing suddenly from clients. It is reaching 15K connection and aerospike node stopped working. We had to restart the aerospike service. Most of the connections are coming from one of our netty servers. We upgraded our client version from 3.3.4 to 4.1.8. But we are facing the same issue.

.

kporter · August 13, 2018, 7:15pm

To mitigate the symptom, you could increase proto-fd-max to 100000.

Brian · August 13, 2018, 8:09pm

ClientPolicy.maxConnsPerNode (default 300) defines the upper bound of connections allowed to each to node. It appears that maxConnsPerNode has been set to a much higher value on your netty server/aerospike client.

Sonaiya_Karthick · August 15, 2018, 3:14am

Increasing proto-fd-max will not solve the problem. Here most of the connections are in the CLOSE_WAIT state. It is keep on increasing. Anyway I increased proto-fd-max to 30K from 15K.

kporter · August 15, 2018, 3:35am

I believe Brian likely identified the root cause.

Sonaiya_Karthick · August 15, 2018, 7:17pm

We have not set any value to ClientPolicy.maxConnsPerNode. So It should be 300 by default per node. 90% of connections are close_wait connections and our cluster is going down because of connections. We had to restart the service frequently.

Server Version : 3.6.4 Client Veersion: 4.1.8

Brian · August 15, 2018, 7:49pm

Many sockets in close/wait state usually means there are lots of timeouts occurring. A client timeout will force the socket to close. If the socket wasn’t closed on a client timeout, data from the previous transaction could be received in the next transaction when the socket is reused.

Some ways to solve this are:

Increase transaction timeout values (Policy.socketTimeout and Policy.totalTimeout), so that timeouts occur less frequently.
Increase the amount of total OS sockets available on your machines. See: [Solved - 100% Working Code]- How to Increase the maximum number of tcp/ip connections in linux - Linux Tutorial - Wikitechy
Add extra hardware resources (machines, network bandwidth, etc…) to increase capacity, thus alleviating bottlenecks/timeouts.

Another reason for large numbers of closed sockets is ClientPolicy.maxSocketIdle (default 55 seconds) is greater than the server’s proto-fd-idle-ms (default 60 seconds). ClientPolicy.maxSocketIdle should always be less than proto-fd-idle-ms.

Sonaiya_Karthick · August 15, 2018, 8:46pm

Thanks @Brian. Will make changes and update you.

I checked the logs from clients. We are getting java.io.EOFException in the client side.

c.z.a.d.m.MomentsMarketLoaderDelegate - Exception while iterating from aerospike record set for moments marketing

com.aerospike.client.AerospikeException: Error Code -1: java.io.EOFException

at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:139) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.MultiCommand.execute(MultiCommand.java:75) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.query.QueryExecutor$QueryThread.run(QueryExecutor.java:146) ~[prod-adtech-adserver.jar:na]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151]

at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]

Caused by: java.io.EOFException: null

at com.aerospike.client.command.MultiCommand.readBytes(MultiCommand.java:223) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.MultiCommand.parseResult(MultiCommand.java:88) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:84) ~[prod-adtech-adserver.jar:na]

... 5 common frames omitted

Can you explain why it is happening? I might be the root cause of the above issue.

Brian · August 16, 2018, 1:14am

EOFException means the server closed the connection. By default, the server will close any open connection that has been idle for 60 seconds. By default, the server will also close connections where the socket is idle for 10 seconds on a running query.

Your iteration of the record set may be backed up and no socket reads are occurring for 10 seconds. It’s important to iterate through the record set as fast as possible.

It’s also critical to catch all application errors during the query and call RecordSet.close() at the end. RecordSet.close() is required on both success and error paths.

Brian · August 16, 2018, 5:48pm

Another question. What is the socketTimeout and totalTimeout values set on that query?

Sonaiya_Karthick · August 17, 2018, 3:44am

@Brian We are setting 5ms for socket and total timeout ClientPolicy clientPolicy = new ClientPolicy(); clientPolicy.timeout = config.getInt(“Aerospike.connectionTimeout”);

    Policy policy = new Policy();
    policy.totalTimeout = config.getInt("Aerospike.totalTimeout",5);
    policy.socketTimeout = policy.totalTimeout;
    clientPolicy.readPolicyDefault = policy;

    QueryPolicy queryPolicy = new QueryPolicy();
    queryPolicy.totalTimeout = config.getInt("Aerospike.Freqcap.totalTimeout", 5);
    queryPolicy.socketTimeout = queryPolicy.totalTimeout;
    queryPolicy.recordQueueSize = config.getInt("Aerospike.Freqcap.RecordQueueSize", 65535);
    clientPolicy.queryPolicyDefault = queryPolicy;

    return clientPolicy;

and we are overwriting queryPolicy in another aerospike query by default query policy like below

`   return new AerospikeSource(MomentsMarketCache.class,
            statement,
            config.getBoolean("Aerospike.MomentsMarket.isUpdatable"),
            momentsMarketLoaderDelegate,
            new QueryPolicy(),
            config.getInt("Aerospike.MomentsMarket.refreshTimeInterval"));`

Actually We are doing 3 transaction such as 1 read and 2 query operation. The socket and total timeout for 1 read and 1 query operation is 5 ms and another query operation has a default query policy.

Brian · August 17, 2018, 5:34am

A 5ms timeout on a query is extremely aggressive. A single record get retrieves a single record from a single node. A query is issued to all nodes in the cluster and the client assembles the records into a single record set stream. There is at least an order of magnitude more work for queries than single record gets.

The server will close the connection if a query timeout occurs, so that is likely why you are getting the EOFExceptions. I suggest raising the query timeout to a value much much higher than it’s currently set.

Topic		Replies	Views
Event loop max connections are exceeding upon new connections being created Java Client	2	3116	June 19, 2017
What causes AerospikeException: java.io.EOFException? Java Client	16	9331	December 3, 2018
Numerous timeout exceptions with PHP Client PHP Client Library	15	2717	March 6, 2017
Frequent Aerospike Client Timeouts at 1000 RPS java , error , client	4	1993	July 31, 2022
Client.get() timeout after a couple of requests Java Client	2	2163	July 11, 2016

No of connection ramps up dramatically

Related topics