EAGAIN errors in Batch responses

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

EAGAIN errors in Batch responses

Background

After the server collects a response for a batch request, it must send that response to a waiting client.

In the non-batch path, Aerospike generally uses a thread per transaction response. However, in the batch subsystem, we collect a number of different responses all into a response queue. Each response queue has its own batch-index-thread, which is sending responses as buffers fill up.

In Unix Sockets, with TCP, a send() call writes to an underlying kernel buffer. This buffer is called “wmem” and controlled with SysCtl. Some of the key variables are net.core.wmem_max and net.core.wmem ( and the related rmem parameters ). See man tcp for more information.

The buffer in question is not to be confused with an Ethernet driver’s transmit or receive queues. That involves the number of packets outstanding between the network device and the kernel, and has its own tuning rules.

With TCP, the userspace program calls the kernel with bytes to send. The kernel has only so much buffer, so copies some of the data into the kernel buffer. That buffer is only freed when the remote side sends a TCP ACK stating that the remote kernel ( not application ) has received the data.

Default sizes for these “rmem” buffers are between 32K and 64K per connection. Some guides propose increasing these values to 1M. This can impact servers with many open TCP connections, so a happy medium must be considered.

Description

With Aerospike, it is necessary for one slow client to not impact other clients.

If the network response is small - smaller than the wmem buffer - then the thread transmitting on the server side will write, fill that buffer, and let the kernel do the repeated interrupt work of shipping the data. Only when that client, on that TCP connection, receives the data will the connection be used for another transaction.

If the network response is large - say, over 64K - then the send() call will either block, or copy some of the data and signal “EAGAIN” stating it only partially consumed the buffer.

With single-record responses in Aerospike, the thread will be responsible for filling the buffer, and retry until the buffer is clear. For a large response, this may take multiple network round trips. This consumes a transaction thread, but only one. If you have substantial number of large responses, and/or some slow clients, you may need to increase the number of these response threads (through transaction-threads-per-queue and transaction-queues. However, other than the consumed transaction processing threads, there will be no impact between one slow client and other fast moving clients.

With multi-record responses (batch read), the situation is more complicated. Batch responses are usually much larger than wmax. Aerospike has a number of batch response threads (batch-index-threads. A given batch request is assigned to a batch response thread, and collects the data in parallel with sending the response, when many keys are involved, in order to reduce latency.

In versions of Aerospike prior to 4.1, the batch response code would repeatedly attempt to send a 64K block, then could move on to the next 64K block response of another TCP connection. This would allow a slow client - which is not sending ACKs in a timely fashion - to increase latency for all the other connections assigned to that batch response thread. Increasing the number of batch response threads helps, but does not solve the problem.

In versions 4.1 and later, the batch response thread will attempt to send using a non-blocking send, then, on receiving an EWOULDBLOCK stating that the underlying buffer is full and the send was only partially consumed, would move to the next connection. This removes cases where there is a lack of threads, and a slow client will increase latency for other batch responses.

Solution

Two numbers bear monitoring: the abandoned logs, and the EAGAIN statistic (batch_index_delay ).

An abandoned log line will look similar to this one:

abandoned batch from 11.22.33.44 with 23 transactions after 30000 ms.

Refer to the server log messages reference manual for further details.

As the different connections are tried repeatedly (using epoll()), the EAGAIN statistic (batch_index_delay ) only shows that a particular client has a long response.

The abandoned log message (which will increase the batch_index_error ) shows that a given batch send has timed out. Each batch request has a timeout, and when that time is reached (to be precise, twice that time or 30 seconds if not set), the server will not waste further time collecting responses and sending them back – instead, it terminates the connection and releases internal memory structures for the responses.

These kinds of “abandonment” can be caused by multiple reasons. Network issues may prevent sending the multiple packets and receiving the ACKs. The remote server itself (client in this case) may not have the processing power to receive the response, as the data must be delivered to the application on long responses (filling the receiver’s “rmem” buffer, notifying the application, having the data read, then frees memory in the rmem buffer, allowing the TCP window to re-open).

Diagnosing this kind of condition can be done with several tools.

A network packet analyser can show whether the TCP window reaches a fully closed state, and how long a closed window prevails for. If the period is very long, the client likely has some kind of a failure ( java garbage collection or application level bug ) preventing the server from sending. There may be excessive retransmissions, slowing network responses, which only are observed when the Aerospike server sends to the client — Aerospike may send a large number of back-to-back Ethernet packets, flooding driver packet buffers or switch buffers.

A quick diagnosis can be done with netstat to check if any network Send-Q or Recv-Q queue backup is occuring for a particular IP address. In case of client being overloaded (ie: high CPU) you would see high Send-Q on server for the client IP, and high Recv-Q on client machine in netstat. This could indicate that client hosts are running low on resources and there is a bottleneck processing the network receive queue on the client. The next step would be to check CPU usage on the client.

Example netstat output:

Server:

netstat -pant|egrep 'ESTA|Send'
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0 305016 1.1.1.102:3000          1.1.1.101:49028         ESTABLISHED -                   
 
Client:

netstat -pant|egrep 'ESTA|Send'
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp6  263700      0 1.1.1.101:49028         1.1.1.102:3000          ESTABLISHED 5402/java           

A simple solution is to increase the BatchPolicy timeout.

Keywords

EAGAIN BATCH TCP wmem rmem wmem_max rmem_max

Timestamp

October 2020