Connection problems (Error Code 11)


#1

Hi,

We have the error: Client timeout: timeout=0 iterations=2 failedNodes=0 failedConns=2

DEBUG: Node *** ...:3000: Error Code 11: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

The error occurs when reading from secondaries index in a long time (> 5-10 minituse). At this time, the CPU load increases about 15-20%.

Cluster has 13 namespaces, 12 indexes (2 text, 10 numeric).

Client: C#/.NET on Windows Server 2088 R2, 11 clients. Server: Centos 6, AS 3.5.4, 3 nodes.

We do not see any low errors on the client or on the server (like message, aerospike.log, other log and limits…).

What can you advise? What could be the reason?


#2

One possible reason is that you have reached your proto-fd-max You can see your current socket usage by grepping for trans-in-progress:

grep trans_in_progress /var/log/aerospike/aerospike.log | tail

Another possibility is that your network capacity has been exceeded, one way to check this is by running:

sar -n DEV

The output is in Kibibytes and your network is probably limited to some number of Gibibits so you will need to do the conversion to see if you have exceeded your networks limit.


#3

Socket usage:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (1552, 100982, 99430) : hb (0, 0, 0) : fab (29, 112, 83)

Some queue sometimes rise to 1.

Average network usage:

rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
10257,38   8817,81   1308,67   2658,85      0,00      0,00     42,74

Everything seems as normal. There are more ideas?


#4

Tcpdump shows a lot of resets from server with AS.

On DEBUG level we found message like:

DEBUG (fabric): (fabric.c:fabric_worker_fn:1476) epoll : error, will close: fb 0x7f9237d8c008 fd 739 errno 115

#5

Maybe you have a special recommendation on setting operating system limits? Especially if have a lot of requests to the secondary indexes.


#6

Hello,

I have a theory, can you send me the client enclosing the query call to aerospike … few lines before and after for validation.

– R


#7
var statement = new Statement();
statement.SetNamespace(ns);
statement.SetSetName(set);
statement.SetIndexName(index);
statement.SetBinNames(binName);
statement.SetFilters(filter);

var result = new List<T>();
RecordSet rs = _client.Query(null, statement);
while (rs.Next())
{
    var instance = CreateInstanceFromRecord(rs.Record); // simple deserialize
    if (instance != null)
    {
        result.Add(instance);
    }
}
rs.Close();

return result;

Query runs from start to end without problems. But a simple get-set requests from another applications (from another or not servers) only in this time give errors (even in other namespaces). Essentially, we did everything by man.


#8

I think you are experiencing locking up when long running queries are working with slow clients. Up untill 3.5.8 we do network IO of intermediate result buffers under the object lock, which would block all the concurrent read and write if client is not consuming result at the good rate.

Try out 3.5.8; it should solve the problem.

– R


#9

you can download it from http://www.aerospike.com/download/server/3.5.8/


#10

Thank you. We’ll try it and let you know about results.


#11

Hi all,

Problem solved with version 3.5.8. Performance is excellent.

Thank you very much.


#12

With the advent of Aerospike consistency, this same error is returned if the roster is not set, or some cases where the partition is unavailable. The client realizes it has no node to send to, and thus returns this error.

If you see this error with consistency enabled, make sure to check your roster and your unavailable partitions.