FAQ - Why does a misconfiguration cause a cold start?

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

FAQ - Why does a misconfiguration cause a cold start?

Detail

When certain parameters are misconfigured in aerospike.conf the startup fails, as expected, however the subsequent start is a cold start as opposed to a fast start (or warm start). Why does this happen?

Answer

The answer to this question lies in the criteria Aerospike uses to determine whether a warm start is possible. A warm start or fast restart occurs when the server is able to find the primary index in shared memory. If it cannot find the primary index, Aerospike determines that it must be rebuilt from disk and so cold starts. The blocks that Aerospike looks for in shared memory are base and treex. Once these are located during startup they are discarded. On a clean shutdown, one of the shutdown tasks is to re-create those blocks so that subsequent startups can locate them.

If the parameter which is disallowed (such as setting address to something which does not match an IP address or the string any) is of an incorrect form then the startup fails before it gets to the shared memory. In that instance, base and treex will remain as they were never discarded.

If the parameter in question is in the right general format but is incorrect (such as setting 1.2.3.4 as an IP address) then the startup will go past basic validation and may have attached to shared memory by the time the error is apparent. In that scenario, base and treex have already been found and discarded. As Aerospike will issue a SIGUSR and abort the startup, this does not allow for base and treex to be re-created. An example is shown below:

Jun 26 2020 10:40:44 GMT: WARNING (socket): (socket.c:773) Error while binding to 1.2.3.4:4333: 99 (Cannot assign requested address)
Jun 26 2020 10:40:44 GMT: CRITICAL (service): (service.c:186) couldn't initialize service socket
Jun 26 2020 10:40:44 GMT: WARNING (as): (signal.c:213) SIGUSR1 received, aborting Aerospike Enterprise Edition build 5.0.0.8 os ubuntu18.04
Jun 26 2020 10:40:44 GMT: WARNING (as): (log.c:604) stacktrace: registers: rax 0000000000000000 rbx 000000000000000a rcx 00007f093077a727 rdx 0000000000000000 rsi 00007ffe4bab3ff0 rdi 0000000000000002 rbp 0000000000000001 rsp 00007ffe4bab3ff0 r8 0000000000000000 r9 00007ffe4bab3ff0 r10 0000000000000008 r11 0000000000000246 r12 00000000000010ed r13 000000000000000e r14 000055da8c393c30 r15 000055da8c393c30 rip 00007f093077a727
Jun 26 2020 10:40:44 GMT: WARNING (as): (log.c:617) stacktrace: found 9 frames: 0x2ad27b 0xf3698 0x7f093077a890 0x7f093077a727 0x2acbe7 0x104451 0x6b217 0x7f092f430b97 0x6baca offset 0x55da8bf9b000

In the log excerpt above we can see that the parameter was specified in the right format, so passed basic checks, but was ultimately incorrect and so caused a failure. By the time the failure had been encountered the index had been found and base and treex were gone. An abort of any description is, by definition, unclean and so base and treex would not be re-created. The subsequent startup then reports:

Jun 26 2020 10:41:51 GMT: INFO (namespace): (namespace_ee.c:351) {bar} found no valid persistent memory blocks, will cold start
Jun 26 2020 10:41:51 GMT: INFO (namespace): (namespace_ee.c:383) {bar} beginning cold start

The valid persistent memory blocks in question are base and treex.

Notes

  • Various consequences of cold start are discussed in the following article.
  • Versions prior to 4.2 handle those shared memory blocks differently and the error message at startup would be different.
  • A clean shutdown followed by a whole instance reboot would also remove the shared memory blocks and force a cold start, with the same found no valid persistent memory blocks message. In order to distinguish a clean shutdown followed by an instance reboot from a non graceful shutdown, look for a prior shutdown not clean log message:
Jun 26 2020 10:41:51 GMT: INFO (drv_ssd): (drv_ssd.c:2920) {test} device /opt/aerospike/data/test.dat prior shutdown not clean
  • Future Aerospike versions may be more forgiving for such ‘late discovered’ misconfiguration and still allow for fast restart.

Keywords

WRONG PARAMETER COLD START WARM FAST START ABORT

Timestamp

June 2020