We try to upgrade our aerospike cluster to enterprise version.
We have cluster of 6 nodes with aerospike-community 3.4.0. We upgraded 1 node to 3.5.14 enterprise on last Thursday. It started on Friday and from that time we have migration process started and our service that massively reads from aerospike have poor perfromance.
In Zabbix at memory usage graph we can see that right after restart we have massive cached memory usage
We use client library (Java) is 3.0.35.
All our java services that read from aerospike (20K reads per second) have troubles with performance after migration started.
Updating to 3.1.3 doesn’t improve things.
Why migrations are so long?
Why it affects performance of all cluster?
We need to upgrade 5 more servers and every upgrade goes for 1 week of bad performance.
Migrations involve background process of migrating data to other nodes, a longer running task thats is throttled down. But, during migration since the migrating data is taking away slice of your normal read/write throughput, you would see the normal read/write transaction throughput a notch lower. The migrations could be tunes so that it could run slower/faster based on what you are trying to achieve.
I think the real question is: why are all these migrations happening when essentially a node is restarted that already has the data? Or doesn’t that work when moving from community to enterprise?
Well a bit of good news here, as of 3.5.8 “Migration performance improved by modifying initial partition balancing scheme for nodes joining a cluster”. This will drastically reduce the amount of time for migrations to complete.
This is a light load, can you describe your server hardware?
Number of nodes in the cluster?
Cloud provided hosts or bare metal hosts?
Number of CPUs per CPU Socket and number of CPU sockets?
Type of storage (SSD, HDD)?
If bare metal, is there a RAID controller?
If bare metal and SSD, make/model of SSDs?
How much RAM?
Other details that may be useful?
Can you provide your aerospike.conf?
What is your SLA?
Migrations will increase the number of IOPS on your storage, it will also disturb any temporal locality provided by the post write queue (cache).
When the node returns we need to resync its partitions with other nodes who may now have newer version of the records.
Recommended setting for service-threads and transaction-queues is same as number of CPU cores. Can you please try with this modification and let us know how your throughput has changed?