AsyncClient suddenly throws a lot of "Error Code -6: Command rejected" exceptions


#1

a bit of history and stats: I’ve been using Aerospike on my production environment for several months, didn’t have any fail related to the client or server (only IT issues). Naturally, my runtime scale increased over time, but nothing major that affected performance.

I’m averaging 30k batch reads per second, and 6k writes per second. I’m running a cluster of 4 EC2 nodes (r3.4xlarge instances - enhanced networking enabled).

There are 2 RT environments, “Env A” is responsible for 80% of the traffic, “Env B” is responsible for the other 20%.

“Env A” contains 10 EC2 instances (with enhanced networking enabled). “Env B” contains 4 EC2 instances (with enhanced networking enabled).

Each instance has one application, connected via a single async java client. Each client has a limit of 65536(2^16) asyncMaxCommands, with MaxCommandAction.REJECT and a single selector thread, this configuration didn’t change since day 1.

Things that changed since day 1:

  • 80% less EC2 instances (e.g. from 50 to 10)
  • started working with enhanced networking
  • slight traffic increase

I started experiencing issues in the past three days, it started with “max-fd” limit of 15000, so I increased it to 50k. After that is started getting tons of “Error Code -6: Command rejected” exceptions. I tried increasing the asyncMaxCommands to 2^17 - that aggravated the issue. I tried decreasing asyncMaxCommands to 5000 - same result. I tried doing less reads/writes - no change. I tried increasing the cluster node count - no change.

The only thing that “helped”, is increasing the EC2 instance count. This is absurd, since each instance is not utilized at all (5% CPU). There are no errors on the Aerospike nodes, and they are pretty “laid back” - 7% CPU.

At the moment my production instance count is at 30 (A) and 15 (B). This seems to “solve” the problem, but it is far from being a reasonable solution.

Any thought? ideas? tests I can make in order to find the issue?


#2

Error code -6: ? http://www.aerospike.com/docs/dev_reference/error_codes.html For future write requests which specify 'BIN_CREATE_ONLY', request fail because any of the bin already exists. Are you running into that in your application code?


#3

asyncMaxCommands is way too high. asyncMaxCommands is the number of concurrent commands allowed at any point in time. The value should equate to the number of commands that your system can handle to just before reaching network bandwidth or client cpu usage limits. Each command takes resources (1 exclusive socket and 1 exclusive direct memory ByteBuffer). You want to minimize resources while maximizing throughput. If 100 asyncMaxCommands reaches network bandwidth or client cpu usage limits, why define 65000+ asyncMaxCommands that consume resources, but do not improve throughput?

“Error Code -6: Command rejected” exceptions occur when you queue more than asyncMaxCommands that can be processed.

To stabilize your system, I recommend the following client settings:

AsyncClientPolicy policy = new Client
policy.asyncMaxCommands = 100;
policy.asyncSelectorThreads = 1;
policy.asyncSelectorTimeout = 10;
policy.asyncMaxCommandAction = MaxCommandAction.BLOCK;
policy.asyncTaskThreadPool = Executors.newCachedThreadPool(new ThreadFactory() {
	public final Thread newThread(Runnable runnable) {
			Thread thread = new Thread(runnable);
			thread.setDaemon(true);
			return thread;
		}
	});

@pgupta The negative return codes are client specific return codes and do not correspond to those server error codes.


#4

@Brian thx for the response…
First off, I can’t BLOCK my connections, I’m running 100% async. my asyncMaxCommandAction configuration at the moment is REJECT.

Since then I did some further testing, here are the updates.

  1. I found out that I had a nested async call within my main communication “flow” with aerospike - when I refactored it, I had a small decrease in error rate.
  2. increasing the selector threads to 16 pretty much solved the issue (or minimized it to the point I’m not experiencing issues at this current scale).
  3. defining an asyncTaskThreadPool massively increased my CPU consumption (probably due to the fact my application is designed in an “actor” pattern with workers etc… and additional dynamic threads cause excessive context switching to worker threads).

Lets say each EC2 instance is handling 65k~ RPS (requests per second)… Each request must go through aerospike. I can’t suffer from latency and timeouts. Should my asyncMaxCommandAction be the same as the RPS or am I missing something here?


#5

You can use REJECT, but you must ensure that the current number of async commands in the queue is less than/equal asyncMaxCommands. Otherwise, your command will be rejected. If you want a non-blocking variable sized queue, then use ACCEPT.

asyncMaxCommands is not RPS. asyncMaxCommands is the maximum number of async commands allowed in the async queue. asyncMaxCommands is only referenced for BLOCK and REJECT. asyncMaxCommands is not used when in ACCEPT mode.


#6

@Brian can you explain the relation between asyncMaxCommands, the queue and selector threads?


#7

Multiple selector threads allow multiple cpus to be utilized. This increases cpu processing power.

The async queue is a virtual queue that contains all async commands that are currently in progress. In reality, each selector thread has its own async queue. The client distributes these commands to different queues in round-robin fashion.

asyncMaxCommands is the max number of commands allowed in the async queue for BLOCK and REJECT modes. In BLOCK mode, the command will be blocked until a slot is availble. In REJECT mode, the command will be rejected if async queues are full. In ACCEPT mode, the async queues are unbounded. You run the risk of running out of memory if in ACCEPT mode.


#8

in theory… lets say that my (client call) -> (server response) -> (client callback) round trip takes exactly 1 second, and I have exactly 10000 RPS, and my asyncMaxCommands=10001 and the selector threads count=1 …
Is it correct to assume that I will always have 1 async command to spare?

In case I will add another selector thread, will I have 10000 commands to spare (5000 from each selector) ?


#9

Yes.

Adding another selector thread will not increase total async queue capacity because asyncMaxCommands is the total max commands for all selector threads. You must increase asyncMaxCommands to add more capacity.

Remember that increasing asyncMaxCommands will not increase RPS if you are already running into network bandwidth limits.