Aerospike backup skips few keys

I was trying to backup all set of a namespace, First I had done backup with 100,000 records, backup contains all the records, I was successfully able to truncate and restore all the records.

But when I tried backup for 1,000,000 and 2,000,000 records, the backup misses few 100s of keys in the backup file. Truncating and restoring results in key loss. Below are the logs of execution when 1999859 record(s) was in the set. Any help will be appreciated.

Namespace contains 1999859 record(s)

Backed up 1999145 record(s),

('backup CMD: ', 'asbackup --host localhost:3000 --namespace test --directory /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL')
2021-12-02 05:56:08 GMT [INF] [21442] Starting 100% backup of localhost:3000 (namespace: test, set: [all], bins: [all], after: [none], before: [none]) to /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL
2021-12-02 05:56:08 GMT [INF] [21442] [src/main/aerospike/as_cluster.c:124][as_cluster_add_nodes_copy] Add node BB94C5090F13902 127.0.0.1:3000
2021-12-02 05:56:08 GMT [INF] [21442] [src/main/aerospike/as_cluster.c:124][as_cluster_add_nodes_copy] Add node BB952887EF31F02 127.0.0.1:3010
2021-12-02 05:56:08 GMT [INF] [21442] [src/main/aerospike/as_cluster.c:124][as_cluster_add_nodes_copy] Add node BB9E4AFF77ED502 127.0.0.1:3011
2021-12-02 05:56:08 GMT [INF] [21442] Processing 3 node(s)
2021-12-02 05:56:08 GMT [INF] [21442] Node ID             Objects        Replication    
2021-12-02 05:56:08 GMT [INF] [21442] BB94C5090F13902     1322691        2              
2021-12-02 05:56:08 GMT [INF] [21442] BB952887EF31F02     1334496        2              
2021-12-02 05:56:08 GMT [INF] [21442] BB9E4AFF77ED502     1342531        2              
2021-12-02 05:56:08 GMT [INF] [21442] Namespace contains 1999859 record(s)
2021-12-02 05:56:08 GMT [INF] [21442] Directory /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL does not exist, creating
2021-12-02 05:56:08 GMT [INF] [21442] Directory /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL prepared for backup
2021-12-02 05:56:08 GMT [INF] [21461] Starting backup for node BB94C5090F13902
2021-12-02 05:56:08 GMT [INF] [21462] Starting backup for node BB952887EF31F02
2021-12-02 05:56:08 GMT [INF] [21463] Starting backup for node BB9E4AFF77ED502
2021-12-02 05:56:08 GMT [INF] [21461] Created new backup file /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL/BB94C5090F13902_00000.asb
2021-12-02 05:56:08 GMT [INF] [21462] Created new backup file /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL/BB952887EF31F02_00000.asb
2021-12-02 05:56:08 GMT [INF] [21463] Created new backup file /home/chandan/aero-bk/BACKUP-2021-12-02-11-26-08-FULL/BB9E4AFF77ED502_00000.asb
2021-12-02 05:56:08 GMT [INF] [21461] No secondary indexes
2021-12-02 05:56:08 GMT [INF] [21461] Backing up 0 UDF file(s)
2021-12-02 05:56:09 GMT [INF] [21460] 0% complete (~2982 KiB/s, ~17834 rec/s, ~171 B/rec)
2021-12-02 05:58:59 GMT [INF] [21460] ~31s remaining
2021-12-02 05:59:09 GMT [INF] [21460] 90% complete (~1854 KiB/s, ~11088 rec/s, ~171 B/rec)
2021-12-02 05:59:09 GMT [INF] [21460] ~17s remaining
2021-12-02 05:59:19 GMT [INF] [21460] 95% complete (~1737 KiB/s, ~10389 rec/s, ~171 B/rec)
2021-12-02 05:59:19 GMT [INF] [21460] ~8s remaining
2021-12-02 05:59:24 GMT [INF] [21461] Completed backup for node BB94C5090F13902, records: 654011, size: 112018581 (~171 B/rec)
2021-12-02 05:59:27 GMT [INF] [21462] Completed backup for node BB952887EF31F02, records: 665738, size: 114025559 (~171 B/rec)
2021-12-02 05:59:29 GMT [INF] [21460] 99% complete (~414 KiB/s, ~2477 rec/s, ~171 B/rec)
2021-12-02 05:59:29 GMT [INF] [21460] ~4s remaining
2021-12-02 05:59:32 GMT [INF] [21463] Completed backup for node BB9E4AFF77ED502, records: 679396, size: 116365166 (~171 B/rec)
2021-12-02 05:59:32 GMT [INF] [21460] Backed up 1999145 record(s), 0 secondary index(es), 0 UDF file(s) from 3 node(s), 342409306 byte(s) in total (~171 B/rec)

@meher @Albot Can you help in any way?

@pgupta @kporter Can you help in any way?

Maybe some records have expired? The backup will not pick those up but the namespace record count will show them until nsup has run. That’s the only thing from the top of my head. Digging deeper would require getting full logs, a collectinfo dump from asadm

All the records were having TTL of 100 days. Can you share how can we take the dump of collectinfo?

The command to capture a collectinfo snapshot is:

sudo asadm -e collectinfo

Having said that, this is not a good platform to do such deeper investigation. Ideal would be to go through Enterprise Support if available.

The only other thing I can think of would be if running on an older version and somehow there was a cluster change during the backup (maybe even a node leaving and re-joining the cluster for a very short time) causing some partition ownership temporarily handed over to another node.

With 2M records, each partition would hold around 500 records, which is around the number of records that seem to be missing. So, if running on an older version and somehow a node briefly left the cluster, a partition could be missed in such situation. The latest versions would make sure that all partitions are backed up even if there was a brief cluster change.

Do you care about that data? If you do, don’t use asbackup alone. Use XDR for a hot-backup or rack-aware for a hot failover if you’re that sensitive. Meher is probably right in that its likely evicted or expired data. You can verify if you have monitoring by checking the increase in the # evicted/expired while the backup is running or dig through the logs for that.

1 Like

Thanks for the reply @meher @Albot!

I again run the backup and monitored the aerospike.log file. Found that scan was not able to get the key of few record following is the log I observed:

Dec 05 2021 06:19:25 GMT: INFO (scan): (scan.c:618) starting basic scan job 9937657453902658616 {temp_ns:SET1} priority 2, sample-pct 100
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18749440
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1231) load_n_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18749440
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1249) load_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18749440
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1280) get_key: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (proto): (proto.c:329) can't get key - skipping record
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18114560
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1231) load_n_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18114560
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1249) load_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18114560
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1280) get_key: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (proto): (proto.c:329) can't get key - skipping record
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18167808
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1231) load_n_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18167808
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1249) load_bins: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1194) read: bad block magic offset 18167808
Dec 05 2021 06:19:25 GMT: WARNING (drv_ssd): (drv_ssd.c:1280) get_key: failed ssd_read_record()
Dec 05 2021 06:19:25 GMT: WARNING (proto): (proto.c:329) can't get key - skipping record

Is this the fix?

In my experience, that’s a bad drive. Sometimes if you pull the offending node and zero out (not trim) the drive it may buy some time. Check the drive health. Endurance/smart/etc. If it happens again after that RMA/replace the drive.

Yeah, I concur.

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.