ASBackup not working C-4.5.0.9

migration
#1

Hi,

I am trying to backup aerospike data using asbackup tool but getting an error. I am using Aerospike server C-4.5.0.9 with 7 nodes with RAM storage with HDD persistence.

$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

$ /usr/bin/asbackup -V
Aerospike Backup Utility
Version 3.2.10
C Client Version 4.5.0
Copyright 2015-2017 Aerospike. All rights reserved.

$ /usr/bin/asbackup -r --host 10.15.20.55,10.15.20.49,10.15.20.51,10.15.20.52,10.15.20.53,10.15.20.54 --namespace platform --directory /data/backup

Errror on asrestore

Error while running node scan for BB90EC9B96B1FAC - code -10: Socket read error: 104, 10.15.20.53:3000, 34490 at src/main/aerospike/as_socket.c:248

Error on node aerospike.log:

WARNING (scan): (scan.c:383) error sending to 10.15.20.65:37246 - fd 539 sz 1048697 Connection timed out

I have used the same command on Aerospike C-3.12 and it was working. Any idea why it’s not working?

#2

any update on this?

#3

Hi there,

I’m sorry to hear that you’ve run into issues while running an Aerospike backup.

Could you provide the following?

  • The complete output of asbackup.

  • Excerpts of your aerospike.log files from all cluster nodes that cover the time from starting the backup until the error sending to ... message.

  • Run iostat -x 5 in parallel to asbackup. I’m curious to see at what rate asbackup writes to your backup medium. Oh, and I need to know which device /data/backup resides on, so that I can match that directory to a device in the iostat output.

Hypothesis: It seems like asbackup doesn’t keep up with the speed at which the server sends records.

If so, then asbackup pushes back and slows down the scan on the server (which collects the records for asbackup). When asbackup pushes back too long and too hard, the server will time out. That’s what seems to be happening ("Connection timed out"). On timeout, the server closes the connection, which then makes asbackup abort with socket read error 104 (TCP connection reset).

This doesn’t explain why things work with Aerospike 3.12. The scan mechanism hasn’t changed significantly since 3.12. But let’s please start by collecting the above data and take it from there.

#4

Oh, and please run asbackup in verbose mode ("--verbose") when capturing the output of asbackup.

Should it the above be true, i.e., should asbackup really be a bottleneck that causes the server to time out, then you might want to try reducing parallelism during the backup using the --parallel option.

By default, asbackup scans up to 10 nodes in parallel. Should it turn out that asbackup cannot keep up with 10 server nodes, try running the backup with --parallel 1 as an experiment. If that makes the problem go away, increase parallelism to 2, etc.

#5

You are right. It is I/O issue. 10% IOWait during backup. I tried with -w 1 and it works most of the time now.