One server goes down during restore

Hi,

We we have an environment of three servers. They are running Aerospikeversion 3.6.1. The External Server doing the restore, has tolls version -tools-3.6.1.

During a restore with standard command options from a directory without extra options for threads…etc, one of the servers shutdown with the following error:

Nov 11 2015 09:35:03 GMT: INFO (info): (thr_info.c::4873) namespace frequency_capping: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Nov 11 2015 09:35:03 GMT: INFO (info): (thr_info.c::4898) namespace users: disk inuse: 131418747392 memory inuse: 15450719936 (bytes) sindex memory inuse: 0 (bytes) avail pct 95 cache-read pct 0.00
Nov 11 2015 09:35:03 GMT: INFO (info): (thr_info.c::4918)    partitions: actual 4080 sync 4089 desync 0 zombie 0 wait 0 absent 4119
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: reads (2 total) msec
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::163)  (00: 0000000002)
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (128586715 total) msec
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::154)  (00: 0128575446) (01: 0000010748) (02: 0000000293) (03: 0000000139)
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::163)  (04: 0000000089)
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 11 2015 09:35:03 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::161) SIGSEGV received, aborting Aerospike Community Edition build 3.6.1 os el6
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: found 11 frames
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x62) [0x47461f]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 1: /lib64/libc.so.6() [0x3624e326a0]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 2: /usr/bin/asd(write_local_dim_single_bin+0xaa) [0x4b5385]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 3: /usr/bin/asd(write_local+0x27d) [0x4b7496]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 4: /usr/bin/asd() [0x4b88e4]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 5: /usr/bin/asd(as_rw_start+0x2af) [0x4ba980]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 6: /usr/bin/asd(process_transaction+0x4eb) [0x4be6d2]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 7: /usr/bin/asd(thr_tsvc_process_or_enqueue+0x3e) [0x4bf345]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 8: /usr/bin/asd(thr_demarshal+0x7b1) [0x488a4c]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 9: /lib64/libpthread.so.0() [0x3625207a51]
Nov 11 2015 09:35:05 GMT: WARNING (as): (signal.c::163) stacktrace: frame 10: /lib64/libc.so.6(clone+0x6d) [0x3624ee89ad]

This triggered a migration for the two server left during the restore. Question:

1- will there be any integrity issues with the restored data given one of the server crashed during restore?

2- from the errors, do you have an idea what could have been the issue?

Could you confirm that you are restoring in a namespace that is configured data in memory and single bin?

Any info you can share on the data being restored? Was it also from a single bin namespace? Can you describe the namespaces / data you had in the backup?

Hi Meher,

My Apologies i didn’t reply to this a year ago. We recently had to do a another restore and faced the same exact issue with the same namespace. The answers to your questions:

Could you confirm that you are restoring in a namespace that is configured data in memory and single bin? Yes

Was it also from a single bin namespace? Yes

Can you describe the namespaces / data you had in the backup?

This is namespace configuration:

namespace NameSpaceName {
        replication-factor 2
        high-water-memory-pct 75
        high-water-disk-pct 75
        memory-size 22G
        single-bin true
        data-in-index true
        default-ttl 0
 
        storage-engine device {
                file /opt/aerospike/NameSpaceName/NameSpaceName.data
                filesize 300G
                data-in-memory true
 
        }
}

I think its worth to note that this namespace contains some deleted sets. Also note that other namespaces were restored but with no errors. We are still using version 3.6.1

We have identified this as an issue (internally tracked under jira AER-4578). This should be addressed in our next release (3.10.1). It is caused in rare situations when the last single-bin from the shared memory block is read 3 bytes beyond the valid memory allocated.

Thanks Meher,

We updated the target aerospike servers to 3.10.1 and the restore went fine for the name space which had the problem.

However, we are facing another issue with another namespace. Its backup was taken from a server version 3.6.1.

During the restore we received the following error:

[36422] Error while storing record - code 4: AEROSPIKE_ERR_REQUEST_INVALID at src/main/aerospike/as_command.c:608

and within the log, we found the following errors:

Nov 25 2016 10:37:08 GMT: WARNING (rw): (write.c:795) write_master: invalid ttl 4077201076
Nov 25 2016 10:37:08 GMT: WARNING (rw): (write.c:795) write_master: invalid ttl 2097754724

the namespace was defined as the following to (on both source and target servers) :

namespace NAmespace_name {
        replication-factor 2
        high-water-memory-pct 75
        high-water-disk-pct 75
        memory-size 2G
        single-bin true
        data-in-index true
        default-ttl 30

        storage-engine device {
                file /opt/aerospike/NAmespace_name/NAmespace_name.data
                filesize 64G
                data-in-memory true
        }
}

Can you advise?

Extremely long TTLs were often a result of a logical error. The TTLs in those warnings range from 66 to 130 years, does your application actually have this requirement? Large TTLs result cause the eviction algorithm to have poor resolution when determining time slices to evict from. TTLs are now capped at 10 years (315360000 seconds) ensuring the minimum eviction resolution is ~37 days.

To restore these records the backup files will need to be modified, and change the TTLs over 315360000 to 315360000.