a bit of history and stats: I’ve been using Aerospike on my production environment for several months, didn’t have any fail related to the client or server (only IT issues). Naturally, my runtime scale increased over time, but nothing major that affected performance.
I’m averaging 30k batch reads per second, and 6k writes per second. I’m running a cluster of 4 EC2 nodes (r3.4xlarge instances - enhanced networking enabled).
There are 2 RT environments, “Env A” is responsible for 80% of the traffic, “Env B” is responsible for the other 20%.
“Env A” contains 10 EC2 instances (with enhanced networking enabled). “Env B” contains 4 EC2 instances (with enhanced networking enabled).
Each instance has one application, connected via a single async java client. Each client has a limit of 65536(2^16) asyncMaxCommands, with MaxCommandAction.REJECT and a single selector thread, this configuration didn’t change since day 1.
Things that changed since day 1:
- 80% less EC2 instances (e.g. from 50 to 10)
- started working with enhanced networking
- slight traffic increase
I started experiencing issues in the past three days, it started with “max-fd” limit of 15000, so I increased it to 50k. After that is started getting tons of “Error Code -6: Command rejected” exceptions. I tried increasing the asyncMaxCommands to 2^17 - that aggravated the issue. I tried decreasing asyncMaxCommands to 5000 - same result. I tried doing less reads/writes - no change. I tried increasing the cluster node count - no change.
The only thing that “helped”, is increasing the EC2 instance count. This is absurd, since each instance is not utilized at all (5% CPU). There are no errors on the Aerospike nodes, and they are pretty “laid back” - 7% CPU.
At the moment my production instance count is at 30 (A) and 15 (B). This seems to “solve” the problem, but it is far from being a reasonable solution.
Any thought? ideas? tests I can make in order to find the issue?