Asbackup failing


#1

Hello,

Tried several times to create backups with asbackup (aerospike and asbackup are the same and latest version).

This is what we got:

2017-02-09 20:16:17 GMT [INF] [20867] ~5h4m59s remaining 2017-02-09 20:16:28 GMT [INF] [20867] 4% complete (~2669 KiB/s, ~1174 rec/s, ~2328 B/rec) 2017-02-09 20:16:28 GMT [INF] [20867] ~2d6h14m53s remaining 2017-02-09 20:16:43 GMT [ERR] [20870] Error while running node scan for 1389179C8D0E1B90 - code -10: Bad file descriptor at src/main/aerospike/as_socket.c:495 2017-02-09 20:16:43 GMT [INF] [20869] Node scan for 1389AF97A10E1B90 aborted 2017-02-09 20:16:43 GMT [INF] [20871] Node scan for 1389EBE5B60E1B90 aborted 2017-02-09 20:16:43 GMT [INF] [20868] Node scan for 1389FB38920E1B90 aborted 2017-02-09 20:16:43 GMT [INF] [20867] Backed up 10549160 record(s), 0 secondary index(es), 0 UDF file(s) from 4 node(s), 26712246629 byte(s) in total (~2532 B/rec)

Any idears how to solve this? we need an fast fix for this, currently were are living without backups…


#2

Hi there,

Thanks for reporting this. Interesting. This looks like the cluster is sending records too fast for asbackup to keep up. Here’s what seems to be happening in detail:

  • The “Bad file descriptor” error message means that the cluster node has closed the connection to the client (= asbackup).
  • asbackup issues a scan to get all records from all cluster nodes. Each cluster node then starts sending records to asbackup. asbackup receives these records and writes them the backup file(s).
  • If a client (= asbackup) isn’t fast enough to read records from the client connection, then records start piling up on the cluster node’s end of the client connection. Once this backlog reaches 10 seconds worth of records on the cluster node’s end of the client connection, the cluster node outputs an error and closes the client connection. This seems to be what we’re seeing here.

Can you say a little more (e.g., number of CPUs and the storage you’re using for your namespaces) about your cluster node machines and the backup machine?

It would also be interesting to understand, whether the backup machine has too little CPU power or too little storage bandwidth for writing the backup files fast enough. What is the CPU usage on the backup machine while the backup is running? Can you run top real quick to check?

Also, let’s try the following experiment, which would help us to figure out whether the storage bandwidth on the backup machine is the problem. Instead of backing up to a real backup file, let’s backup to /dev/null. Something like this would do the trick:

asbackup --namespace foobar --output - >/dev/null

When you pass - as the file name to the --output option, this will tell asbackup to backup to stdout, which we redirect to /dev/null by saying >/dev/null.

If backing up to /dev/null succeeds, then we know that the storage bandwidth of your backup machine is likely the issue here. If that turns out to be true and your CPU load isn’t too high, then compression might be an option. Compression trades CPU power for storage bandwidth: it takes CPU cycles to compress data, but the resulting data that’s being written to storage is much smaller. You can compress backup files on the fly like this:

asbackup --namespace foobar --output - | gzip -1 >backup-file.asb.gz

Again we backup to stdout by saying --output -. This time, instead of sending the backup to /dev/null, however, we pipe it to gzip, and redirect the output of gzip to a compressed backup file, backup-file.asb.gz.

I’m using gzip with the -1 option to save CPU cycles. However, if your backup machine has a powerful CPU, then you might get away with higher compression. Otherwise, you would run into the same problem again: asbackup wouldn’t be able to keep up with the cluster node, because compression takes too much time.

Obviously, you might just as well compress using lz4 or any other command line tool that can read from stdin and write to stdout.

Let me know how things go.

Thomas


#3

@tlo would --parallel option (how many nodes in the cluster?) and --nice 1 (traffic running at 2.6MB/s) help throttle down the backup?


#4

@pgupta Oh! Yes! Good point! I had forgotten that we have --parallel! Yes. You are right. Using, for example, --parallel 1 would reduce the load on the backup machine by 75% (as we have a 4-node cluster). This could very well be enough to keep the cluster from overwhelming the backup machine.,

The --nice option wouldn’t be helpful in this case. It works by artificially slowing down the backup machine, so that it pushes back against the cluster sending too fast. Essentially, --nice X simulates a storage bandwidth of X MiB/s on the backup machine.

I’m now realizing that the --nice option doesn’t really work at the moment! If the client (asbackup) pushes back against a cluster node and records thus start piling up on the cluster node’s end of the connection, this will trigger the behavior that @blonkel is running into: as soon as 10 seconds worth of records have piled up on the cluster node’s end of the connection, the cluster node will tear down the connection. So, --nice will also trigger the exact behavior that’s causing problems here.

We didn’t always have this 10-second rule. Previously, the cluster node would simply pause and wait for the client to catch up. This was the behavior when we implemented the --nice option. Back then, it would make the client (asbackup) push back and, thus, slow down the cluster node. Now, with the 10-second rule, we’ve effectively broken the --nice option. Instead of slowing down, the cluster node closes the connection.

We’re already looking into a dynamic configuration option to disable this 10-second rule - or at least to set the 10-second time limit to something else. Once the Aerospike server supports this, asbackup should disable the 10-second rule for its scan, so that it can push back as much as it wants without a cluster node closing the connection.

Thomas