I have a single Aerospike client, created with the following configuration:
ClientPolicy rwPolicy = new ClientPolicy();
rwPolicy.maxConnsPerNode = 300;
rwPolicy.maxSocketIdle = 0;
rwPolicy.minConnsPerNode = 15;
I use 8 threads in the event loops.
I have a scenario where I need to do a lot of “put” operations (50k-500k) in short burst. I’m simply invoking
client.put in a for loop. Some of these writes can happen in parallel too.
After some writes (10k-50k), my program crashes with out-of-memory error. Note that I don’t receive a Java OutOfMemoryException. Instead, Kubernetes kills my container with OOM error (137 exit code). So this likely means native memory usage is exceeding limits. I’ve left a couple of GBs for native memory, but still run into this issue.
Have others run into this issue too? Do we need to throttle the writes?
We have an Aerospike Enterprise and Managed Service license if that matters.
Any help will be appreciated. Thanks.
Yes, the puts should be throttled. By default, async put operations are immediately executed by an event loop. 500k put transactions executed immediately is likely to exceed maxConnsPerNode or cause the async delay queue to grow too large, depending on your event loop configuration.
EventPolicy should be set in your case. maxCommandsInProcess should be set to max number of transactions that can consistently be run concurrently on your machine. maxCommandsInQueue should be set to some value that does not cause out-of-memory errors when the async delay queue is full.
Just to add to Brian’s suggestions: Have you considered batch operations? Assuming batching is possible, they can provide better resource utilization and better throughput.
Thanks Brian and neelp.
I played around with the EventPolicy like Brian suggested. I also upgraded the Aerospike version. These seem to have helped with the situation.
We’re currently doing asynchronous single requests, though due to pattern of client requests, this sometimes hits 50k-500k requests in parallel. Reading up on batch operations, it does seem like they’ll help. When we implemented our solution, batch ops weren’t supported.
I’ll keep an eye on the jobs, and if they memory usage goes crazy again, I’ll migrate the code to use batch requests.