FAQ - What is the expected behaviour when an Aerospike node experiences an SSD hardware failure?


#1

FAQ - What is the expected behaviour when an Aerospike node experiences an SSD hardware failure?

Detail

When an SSD experiences a catastrophic hardware failure, what is the behaviour of the Aerospike node?

Answer

Aerospike behavior depends on the specific transaction type that failed due to the hardware failure.

- Regular client read

A regular read fail will not cause the server to abort. The server will report of error AS_PROTO_RESULT_FAIL_UNKNOWN which gets mapped to AEROSPIKE_ERR_SERVER on the client side.

- Read as part of a write transaction

In Aerospike, a write transaction reads the previous value of the record before updating it, unless it is a brand new record or you specifically write with the ‘replace’ flag). In the event that there are read failures in the way of a write, the write is failing but the error is reported back to the client and the server doesn’t SIGABRT in this case since the data integrity is preserved and the error is reported back to the client. So, the client can decide what to do - for example try reading the data from a replica and issuing a replace.

The flow of logs would be similar to this:

Jan 18 2017 00:19:40 GMT: WARNING (drv_ssd): (drv_ssd.c:1185) /dev/sdj1: read failed (-1): size 37376: errno 5 (Input/output error)
Jan 18 2017 00:19:40 GMT: WARNING (drv_ssd): (drv_ssd.c:1239) load_n_bins: failed ssd_read_record()
Jan 18 2017 00:19:40 GMT: WARNING (rw): (write.c:1412) {Entities} write_master: failed as_storage_rd_load_n_bins()<Digest>:0x0fa89634728ef4b27ecfighace630822e826e7b51d

In the event that this read is failing, the only side effect will be that the specific write-block will be skipped and not be defragged. The server will report of error AS_PROTO_RESULT_FAIL_UNKNOWN which gets mapped to AEROSPIKE_ERR_SERVER on the client side.

In the event that the client retries for the same key, this may result in the same error on the server (if this is not a recoverable error at the device interface level). This would therefore need to be monitored and potentially have the node taken out gracefully.

For server version prior to 3.9.1, the above behavior will actually also result in an Aerospike process shutdown through a SIGABRT.

- Write

Aerospike is designed to abort (SIGABRT to be precise) in case the integrity of the data is potentially compromised. So, for example, if a large-block-write fails (those are the asynchronous writes, when the streaming write buffers are full and flushed), the server will shutdown through a SIGABRT.

Jan 03 2016 13:59:07 GMT: CRITICAL (drv_ssd): (drv_ssd.c:ssd_flush_swb:1329) /dev/sdb1: DEVICE FAILED write: errno 5 (Input/output error)
Jan 03 2016 13:59:07 GMT: CRITICAL (drv_ssd): (drv_ssd.c:ssd_flush_swb:1329) /dev/sdb1: DEVICE FAILED write: errno 5 (Input/output error)
Jan 03 2016 13:59:07 GMT: WARNING (as): (signal.c::94) SIGABRT received, aborting Aerospike Enterprise Edition build 3.6.3 os el6

With the node shutdown, the cluster will then rebalance data and continue running in a manner transparent to the application rather than continue throwing I/O errors. This prioritizes data integrity without affecting cluster uptime.

- Other operations - e.g. Writing to server log, XDR digest log

In the event that writes to server log fail due to hardware failure, the failure is ignored and Aerospike process continues to run. In the event of a digest log read or write fail, Aerospike retries.

- System Metadata read/write

In the event such as a UDF update or a security permission update failure due to a hardware issue, the warning is logged. In case of a read operation, operation acts as a NULL metadata. In case of a write failure, the operation is ignored.

Notes

  • If a new (same size) SSD is installed and is configured at the same place (same order) in the configuration file as the previously failed one, it is possible to warm start the node. The missing data on the fresh device will be populated through migrations. (Aerospike used a hash function to assign record to different devices, therefore the order of the SSD devices in the configuration is important).
  • The scenario described above must be considered distinct from a situation where an SSD is experiencing overloads or difficulties but has not catastrophically failed. In those situations the issue may manifest as latency or other errors. If Aerospike can read and write to the disk, however slowly, it will attempt to keep running.

Keywords

DISK FAIL SSD CRASH NODE

Timestamp

2/1/17