Asbackup - Why is asbackup performing slowly


#1

Why is asbackup performing slowly

Problem description

When performing asbackup, you may find it being slower than expected. This article looks into most common bottlenecks and how to address them.

Explanation

When checking the performance of asbackup, we have to take into account possible bottlenecks. This usually comes down to:

  1. Network being over-utilized – this can be easily confirmed by checking network utilization against maximum link/connection speed between the cluster nodes and the server where backup is taking place. Note that this could be over-utilization on either the Aerospike server or the backup server side (or any amount of hops in between).
  2. Aerospike disks being over-utilized – a backup is basically a scan and for namespaces that do not have data in memory, this will increase the load on the storage subsystem. In this case you will see disk utilization (iostat / monitoring) hit 100% and await/wwait spike up during the backup. Please note the await/wwait as important factors to check. If disks reach 95-100% utilization but wait stays very low (way below 1), this usually means that the disks are still managing with the load. The iostat wait indicates how many read/write operations are waiting on the channel to be delivered to the disk controller and processed. This should normally be below 1 for most operations and disks.
  3. The backup server disks being over-utilized – this usually happens if too many nodes at the same time are being backed up and the backup server disks are not keeping up.
  4. The curious case of gzip.

Network overutilized

As highlighted above, the easiest way is to check network utilization and packet drops. An alternative way would be to test network capacity (or remaining capacity) using iperf. More can be read on this article.

To fix this, you can either:

  1. Add more network capacity.
  2. Utilize the asbackup --nice feature, to slow down reads. Refer to the asbackup output options.
  3. Perform a backup on the local nodes only of the said nodes (i.e. run asbackup on each node, using --node-list 127.0.0.1:3000 on each node), and then copy the backup over the network to a central location, limiting the copy speed, or using gzip before transferring the local backup to a centralized location. Refer to asbackup connection options.

Note, in regards to point 3, the Curious case of gzip section below.

Disks are being overutilized

The best way to check this is to run iostat during the backup and see the disk utilization. This is true for points 2 and 3 in the summary - on both, the aerospike server and the backup server. The result will allow to see if the disks have reached their limits, causing backups to slow down. More on iostat can be found on this article.

If Aerospike server disks are found to be the cause of slow backups, you will need to add more capacity, for example by adding more nodes.

If backup server disks are found to be the cause of slow backups, you will need to either:

  1. Split backups so that some nodes are backed up to server1 and some to server2.
  2. Redesign the backup server to use faster disks or RAID0/5/6/10 to add more available write speed.
  3. If the scan triggered off by the backup is too aggressive, refer to the Managing Scan, specifically the scan-threads configuration option, in order to throttle the scan.
  4. Use gzip while performing backups. Note the case of gzip below as well.

The curious case of gzip

Many backup operations are performed with piping through gzip, for example:

asbackup -n test -o - | gzip > backup.asc.gz

While testing backup speed with and without gzip, you may find that backing up without gzip is faster. This would be due to the cpu bottlebeck. Gzip is, by design, single-threaded. As such, with enough data coming in, it is easy to reach 100% CPU utlization for one core, used by gzip, that being the bottleneck. This can be easily checked using top. If the CPU utilization is at, or close to, 100% for gzip, it is fully using the one core and cannot backup any faster.

In this case you can do one of the following:

  1. use the -1 (or -2/-3) to compress less. At the expense of larger backups, this will use less CPU and help gzip be faster.
  2. backup only one node at a time, to it’s own local disk (run asbackup locally with --node-list 127.0.0.1:3000), with or without gzip, and transfer the backup file to central location once that is done
  3. if using a central backup server with core to spare: run multiple asbackup, one per node, piping through gzip. This will allow you to run multiple gzip operations with backup at the same time. E.g.:
nohup asbackup -n test --node-list 192.168.0.10:3000 -o - |gzip > backup_10.asc.gz &
nohup asbackup -n test --node-list 192.168.0.11:3000 -o - |gzip > backup_11.asc.gz &
nohup asbackup -n test --node-list 192.168.0.12:3000 -o - |gzip > backup_12.asc.gz &
...
  1. use an alternative compression tool, which supports multithreading. While we cannot recommend tools for their speed/reliability, multiple solutions exists, such as: pigz, pgzip, lbzip2, pbzip2, pxz. WARNING: replace gzip at your own risk.

An alternative issue could be due to gzip buffers. Gzip uses relatively small buffers. If they don’t fill fully fast enough, gzip compresses smaller amounts before flushing out the result. This results in larger seek operations and disk utilization. Fortunately, there is a tool called buffer in linux. This allow us to have buffers fill before flushing to gzip, allowing a more optimal operation. See https://linux.die.net/man/1/buffer for more information.

Notes

Take special consideration against running a full cluster backup on one of the nodes themselves. Either backup each node to itself, or use a separate server for backups. If you run a full cluster backup from one of the nodes, consider using --nice or properly sizing network on that node. This is because of the amount of network required for the backup to complete will be very significant. Considering for example a 6-node cluster, running a full cluster backup from node 1 will result in that node backing up it’s own data plus receiving all data on the network from the remaining 5 nodes. This is amplified, if the backup is saved for example to a remote NFS share. In this case, node 1 would be receiving data from 5 nodes and sending 6-node data (5 nodes plus itself) to the NFS share. That can easily result in network overload.

Please also consider the case of uzing gzip if you are running the backup on a node itself may result in a CPU core being used at 100% while the backup runs - if the amount of data coming in (local disk / multiple node backup with fast network) is fast enough.

Keywords

ASBACKUP GZIP SLOW

Timestamp

10/01/2018