Hi there,
Thanks for reporting this. Interesting. This looks like the cluster is sending records too fast for asbackup to keep up. Here’s what seems to be happening in detail:
- The “Bad file descriptor” error message means that the cluster node has closed the connection to the client (= asbackup).
- asbackup issues a scan to get all records from all cluster nodes. Each cluster node then starts sending records to asbackup. asbackup receives these records and writes them the backup file(s).
- If a client (= asbackup) isn’t fast enough to read records from the client connection, then records start piling up on the cluster node’s end of the client connection. Once this backlog reaches 10 seconds worth of records on the cluster node’s end of the client connection, the cluster node outputs an error and closes the client connection. This seems to be what we’re seeing here.
Can you say a little more (e.g., number of CPUs and the storage you’re using for your namespaces) about your cluster node machines and the backup machine?
It would also be interesting to understand, whether the backup machine has too little CPU power or too little storage bandwidth for writing the backup files fast enough. What is the CPU usage on the backup machine while the backup is running? Can you run top
real quick to check?
Also, let’s try the following experiment, which would help us to figure out whether the storage bandwidth on the backup machine is the problem. Instead of backing up to a real backup file, let’s backup to /dev/null
. Something like this would do the trick:
asbackup --namespace foobar --output - >/dev/null
When you pass -
as the file name to the --output
option, this will tell asbackup to backup to stdout
, which we redirect to /dev/null
by saying >/dev/null
.
If backing up to /dev/null
succeeds, then we know that the storage bandwidth of your backup machine is likely the issue here. If that turns out to be true and your CPU load isn’t too high, then compression might be an option. Compression trades CPU power for storage bandwidth: it takes CPU cycles to compress data, but the resulting data that’s being written to storage is much smaller. You can compress backup files on the fly like this:
asbackup --namespace foobar --output - | gzip -1 >backup-file.asb.gz
Again we backup to stdout
by saying --output -
. This time, instead of sending the backup to /dev/null
, however, we pipe it to gzip
, and redirect the output of gzip
to a compressed backup file, backup-file.asb.gz
.
I’m using gzip
with the -1
option to save CPU cycles. However, if your backup machine has a powerful CPU, then you might get away with higher compression. Otherwise, you would run into the same problem again: asbackup wouldn’t be able to keep up with the cluster node, because compression takes too much time.
Obviously, you might just as well compress using lz4
or any other command line tool that can read from stdin
and write to stdout
.
Let me know how things go.
Thomas