We have an Aerospike cluster of 8 nodes (v 3.9.0.3). Some of its key stats are as follows:
Total No of Records - 1.2B
Total Disk Usage - 740GB
Total Disk Capacity - 2TB
I’m trying to take a backup of all the data using asbackup tool from a separate backup node.
After initiating the backup, aerospike server nodes are maxing out on the disk. iostat is showing 100% util, with the read throughput of ~200 MBps on each node. Since it’s a complete backup, I would expect the aerospike node to transfer a similar amount of data to the backup node. However, aerospike nodes are sending only ~6MBps data each to the backup node. Correspondingly, I’m getting only around 50MBps throughput for the backup. Is having such a huge difference between disk and io throughput expected? If so, how it can be explained?
I found the behaviour to be similar during scan jobs. And eventually, the aerospike node is reading much more data from the disk (read throughput x total time for scan) than the disk usage or even the disk size.
Can you provide us the asbackup command you are using? Also, usually for backups we take them all locally on the system like asbackup -l 127.0.0.1:3000 -o - | pigz -2 > backup.file so that would help network concerns. If disk is being maxed you, you may want to adjust your cluster’s scan-threads - or specify -N in your asbackup command to throttle things. Scans can be very resource intensive, so it really should start there - what parameters are you passing? What does your normal IO look like?
However, this behaviour is not specific to asbackup but scan operations in general.
I’m okay with disk maxing out and understand that it can be throttled if required. But I’m not able to figure out why the actual backup or the network activity is so slow compared to the disk read throughput. Normal IO (without the scan) is very minimal with disk util. being around 5-10%.
I would have thought that this might be because you’re specifying set name, which would require inspecting a bunch of records and then not transferring them. One reason I can think of for network being larger, is that Aerospike stores the data to IO very efficiently - but when it has to export to a backup file it has to add extra parameters, definitions, and formatting to make the backup restorable. I wouldn’t expect that to make sure a large difference though… What is your main concern here? Understanding the disparity between the IO and network data rates, or trying to get a proper backup? Why are you running such an old version of Aerospike? What version of the backup utility are you running? Do you see the same behavior if you run a local backup like asbackup -n {namespace} -l 127.0.0.1:3000 -d {dir} ?
Actually, it’s the reverse. The network activity is much lower than the disk. The servers are reading approx. 200MBps from the disk(100% util), whereas they are sending only at around 6MBps to the backup VM via the network. I’m not able to figure out what could be the reason for this. There is no filtering of records as I’m taking the backup of the full namespace. Overall, it’s slowing down the backup too.
Using the same version of asbackup i.e. 3.9.0.3. It’s in our pipeline to upgrade it more recent versions soon.
That version is so old I can’t even look up CE changelog notes for it in the usual server notes place. I think an upgrade would probably be the best first bet. Are you sure there is only 1 scan going on? Did you try local backup? Can you provide us a screenshot or paste of how you’re measuring IO throughput? Is it steady, or does it spike to 200MB/s?
I agree with the version thing. Hopefully, we will upgrade it soon.
Regarding the issue, we are no more blocked. Though the backup was slow, it eventually got completed. I just wanted to check if it’s expected to have such big discrepancy between network and disk activity during the scan or backup. Our Aerospike servers are on Google Cloud and I was using the console provided by GCP for monitoring the disk/network. For the disk, I double-checked using iostat and it was reporting the same throughput. There was only one scan running and disk throughput was steady at 200MB/s.