No of connection ramps up dramatically

Hi,

We are using aerospike in our organization extensively. We are facing some issues frequently nowadays. no of connection to aerospike is increasing suddenly from clients. It is reaching 15K connection and aerospike node stopped working. We had to restart the aerospike service. Most of the connections are coming from one of our netty servers. We upgraded our client version from 3.3.4 to 4.1.8. But we are facing the same issue.

.

To mitigate the symptom, you could increase proto-fd-max to 100000.

ClientPolicy.maxConnsPerNode (default 300) defines the upper bound of connections allowed to each to node. It appears that maxConnsPerNode has been set to a much higher value on your netty server/aerospike client.

Increasing proto-fd-max will not solve the problem. Here most of the connections are in the CLOSE_WAIT state. It is keep on increasing. Anyway I increased proto-fd-max to 30K from 15K.

I believe Brian likely identified the root cause.

We have not set any value to ClientPolicy.maxConnsPerNode. So It should be 300 by default per node. 90% of connections are close_wait connections and our cluster is going down because of connections. We had to restart the service frequently.

Server Version : 3.6.4 Client Veersion: 4.1.8

Many sockets in close/wait state usually means there are lots of timeouts occurring. A client timeout will force the socket to close. If the socket wasn’t closed on a client timeout, data from the previous transaction could be received in the next transaction when the socket is reused.

Some ways to solve this are:

  1. Increase transaction timeout values (Policy.socketTimeout and Policy.totalTimeout), so that timeouts occur less frequently.

  2. Increase the amount of total OS sockets available on your machines. See: [Solved - 100% Working Code]- How to Increase the maximum number of tcp/ip connections in linux - Linux Tutorial - Wikitechy

  3. Add extra hardware resources (machines, network bandwidth, etc…) to increase capacity, thus alleviating bottlenecks/timeouts.

Another reason for large numbers of closed sockets is ClientPolicy.maxSocketIdle (default 55 seconds) is greater than the server’s proto-fd-idle-ms (default 60 seconds). ClientPolicy.maxSocketIdle should always be less than proto-fd-idle-ms.

Thanks @Brian. Will make changes and update you.

I checked the logs from clients. We are getting java.io.EOFException in the client side.

c.z.a.d.m.MomentsMarketLoaderDelegate - Exception while iterating from aerospike record set for moments marketing

com.aerospike.client.AerospikeException: Error Code -1: java.io.EOFException

at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:139) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.MultiCommand.execute(MultiCommand.java:75) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.query.QueryExecutor$QueryThread.run(QueryExecutor.java:146) ~[prod-adtech-adserver.jar:na]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151]

at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]

Caused by: java.io.EOFException: null

at com.aerospike.client.command.MultiCommand.readBytes(MultiCommand.java:223) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.MultiCommand.parseResult(MultiCommand.java:88) ~[prod-adtech-adserver.jar:na]

at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:84) ~[prod-adtech-adserver.jar:na]

... 5 common frames omitted

Can you explain why it is happening? I might be the root cause of the above issue.

EOFException means the server closed the connection. By default, the server will close any open connection that has been idle for 60 seconds. By default, the server will also close connections where the socket is idle for 10 seconds on a running query.

Your iteration of the record set may be backed up and no socket reads are occurring for 10 seconds. It’s important to iterate through the record set as fast as possible.

It’s also critical to catch all application errors during the query and call RecordSet.close() at the end. RecordSet.close() is required on both success and error paths.

Another question. What is the socketTimeout and totalTimeout values set on that query?

@Brian We are setting 5ms for socket and total timeout ClientPolicy clientPolicy = new ClientPolicy(); clientPolicy.timeout = config.getInt(“Aerospike.connectionTimeout”);

    Policy policy = new Policy();
    policy.totalTimeout = config.getInt("Aerospike.totalTimeout",5);
    policy.socketTimeout = policy.totalTimeout;
    clientPolicy.readPolicyDefault = policy;

    QueryPolicy queryPolicy = new QueryPolicy();
    queryPolicy.totalTimeout = config.getInt("Aerospike.Freqcap.totalTimeout", 5);
    queryPolicy.socketTimeout = queryPolicy.totalTimeout;
    queryPolicy.recordQueueSize = config.getInt("Aerospike.Freqcap.RecordQueueSize", 65535);
    clientPolicy.queryPolicyDefault = queryPolicy;

    return clientPolicy;

and we are overwriting queryPolicy in another aerospike query by default query policy like below

`   return new AerospikeSource(MomentsMarketCache.class,
            statement,
            config.getBoolean("Aerospike.MomentsMarket.isUpdatable"),
            momentsMarketLoaderDelegate,
            new QueryPolicy(),
            config.getInt("Aerospike.MomentsMarket.refreshTimeInterval"));`

Actually We are doing 3 transaction such as 1 read and 2 query operation. The socket and total timeout for 1 read and 1 query operation is 5 ms and another query operation has a default query policy.

A 5ms timeout on a query is extremely aggressive. A single record get retrieves a single record from a single node. A query is issued to all nodes in the cluster and the client assembles the records into a single record set stream. There is at least an order of magnitude more work for queries than single record gets.

The server will close the connection if a query timeout occurs, so that is likely why you are getting the EOFExceptions. I suggest raising the query timeout to a value much much higher than it’s currently set.