Unpredictable excessive memory usage during migration on Aerospike cluster


#1

Problem Description

A properly sized Aerospike cluster starts migrating. In Aerospike statistics the memory use is as expected. Operating system monitoring shows that the memory usage is far higher than expected and the system starts swapping.

Explanation

The reason this is happening is that on migration, Aerospike currently loads an entire partition of data into memory for each thread configured using the ‘migrate-threads’ parameter.

http://www.aerospike.com/docs/reference/configuration/#migrate-threads

Once loaded into memory the records are transmitted one at a time, once all the records are shipped the partition is dropped from memory. This memory usage is not shown in the Aerospike stats and as a result it is not included in calculations such as breaching high water marks or entering stop_writes = true.

When a node is receiving a migrations it does so record by record, this may happen before it has started shipping out partitions itself, so there could be a compounded increase in memory. The memory used to receive migrate writes is tracked in Aerospike stats and so will have a bearing on high water marks and stop_writes.

Solution

There are multiple configuration parameters than can be used to tune migrations, these are given here.

http://www.aerospike.com/docs/operations/tune/migration/

Where these parameters are changed, this should be done in an incremental manner, observing the effects on the Linux system as they are increased, rather than increasing on a larger scale. In particular, care should be taken to increase the migrate-threads parameter in a controlled manner.

Notes

More complete tracking of memory usage from within Aerospike is on the roadmap for future releases.