Connection problems (Error Code 11)

tx_user · April 6, 2015, 2:36pm

Hi,

We have the error: Client timeout: timeout=0 iterations=2 failedNodes=0 failedConns=2

DEBUG: Node *** ...:3000: Error Code 11: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

The error occurs when reading from secondaries index in a long time (> 5-10 minituse). At this time, the CPU load increases about 15-20%.

Cluster has 13 namespaces, 12 indexes (2 text, 10 numeric).

Client: C#/.NET on Windows Server 2088 R2, 11 clients. Server: Centos 6, AS 3.5.4, 3 nodes.

We do not see any low errors on the client or on the server (like message, aerospike.log, other log and limits…).

What can you advise? What could be the reason?

kporter · April 6, 2015, 5:15pm

One possible reason is that you have reached your proto-fd-max You can see your current socket usage by grepping for trans-in-progress:

grep trans_in_progress /var/log/aerospike/aerospike.log | tail

Another possibility is that your network capacity has been exceeded, one way to check this is by running:

sar -n DEV

The output is in Kibibytes and your network is probably limited to some number of Gibibits so you will need to do the conversion to see if you have exceeded your networks limit.

tx_user · April 6, 2015, 8:03pm

Socket usage:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (1552, 100982, 99430) : hb (0, 0, 0) : fab (29, 112, 83)

Some queue sometimes rise to 1.

Average network usage:

rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
10257,38   8817,81   1308,67   2658,85      0,00      0,00     42,74

Everything seems as normal. There are more ideas?

tx_user · April 9, 2015, 7:32am

Tcpdump shows a lot of resets from server with AS.

On DEBUG level we found message like:

DEBUG (fabric): (fabric.c:fabric_worker_fn:1476) epoll : error, will close: fb 0x7f9237d8c008 fd 739 errno 115

tx_user · April 10, 2015, 2:48pm

Maybe you have a special recommendation on setting operating system limits? Especially if have a lot of requests to the secondary indexes.

raj · April 13, 2015, 11:33pm

Hello,

I have a theory, can you send me the client enclosing the query call to aerospike … few lines before and after for validation.

– R

tx_user · April 14, 2015, 12:24pm

var statement = new Statement();
statement.SetNamespace(ns);
statement.SetSetName(set);
statement.SetIndexName(index);
statement.SetBinNames(binName);
statement.SetFilters(filter);

var result = new List<T>();
RecordSet rs = _client.Query(null, statement);
while (rs.Next())
{
    var instance = CreateInstanceFromRecord(rs.Record); // simple deserialize
    if (instance != null)
    {
        result.Add(instance);
    }
}
rs.Close();

return result;

Query runs from start to end without problems. But a simple get-set requests from another applications (from another or not servers) only in this time give errors (even in other namespaces). Essentially, we did everything by man.

raj · April 14, 2015, 4:14pm

I think you are experiencing locking up when long running queries are working with slow clients. Up untill 3.5.8 we do network IO of intermediate result buffers under the object lock, which would block all the concurrent read and write if client is not consuming result at the good rate.

Try out 3.5.8; it should solve the problem.

– R

raj · April 14, 2015, 4:14pm

you can download it from Aerospike Downloads | Aerospike

tx_user · April 15, 2015, 6:31am

Thank you. We’ll try it and let you know about results.

tx_user · April 17, 2015, 6:33am

Hi all,

Problem solved with version 3.5.8. Performance is excellent.

Thank you very much.

bbulkow · August 10, 2018, 1:04am

With the advent of Aerospike consistency, this same error is returned if the roster is not set, or some cases where the partition is unavailable. The client realizes it has no node to send to, and thus returns this error.

If you see this error with consistency enabled, make sure to check your roster and your unavailable partitions.

Topic		Replies	Views
Numerous timeout exceptions with PHP Client PHP Client Library	15	2726	March 6, 2017
Client.get() timeout after a couple of requests Java Client	2	2164	July 11, 2016
No of connection ramps up dramatically Java Client	11	3789	August 17, 2018
Aerospike client connections for a node is very high than other nodes in cluster and causing read connection timeouts Client Libraries	0	1508	September 11, 2016
Aerospike Client timeout Error	6	5325	September 5, 2022

Connection problems (Error Code 11)

Related topics