What is the "drive set with unmatched headers" error?

FAQ - Why do I see an assertion with “drive set with unmatched headers”?

Details

Aerospike service failed to start up with “drive set with unmatched headers” messages:

Jun 04 2019 18:35:46.778 GMT: FAILED ASSERTION (drv_ssd): (drv_ssd.c:2984) {test} drive set with unmatched headers - devices /dev/disk/by-id/google-local-ssd-0-part1 & /dev/disk/by-id/google-local-ssd-1-part1 have different signatures

Or when using data files:

Jun 07 2019 20:20:04 GMT: FAILED ASSERTION (drv_ssd): (drv_ssd.c:2984) {test} drive set with unmatched headers - devices /opt/aerospike/test1.dat & /opt/aerospike/test2.dat have different signatures

Older versions may look like this:

Jul 18 2016 16:05:08 GMT: CRITICAL (drv_ssd): (drv_ssd.c::3353) namespace test: drive set with unmatched headers - devices /data/test01 & /data/test02 have different signatures

Answer

This message indicates something unexpected when processing the headers of the configured devices for the namespace. This is likely due to corrupted disk data but there are other potential reasons as well:

1. Misconfiguration of disk partitions

For example, overlapped partitions. i.e. the start of second partition falls inside the previous disk partitions. So as data got written, it corrupted the initial header of the second partition. Use the “-l” option of either sfdisk or parted to verify the partitions are configured correctly. Here is an example for a healthy partition configuration:

sfdisk -l  /dev/sdc

Disk /dev/sdc: 36481 cylinders, 255 heads, 63 sectors/track
Units: cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

  Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sdc1        248+  13373-  13125- 105421824   83  Linux
/dev/sdc2      13373+  26746-  13374- 107421696   83  Linux
/dev/sdc3      26746+  36481-   9735-  78192640   83  Linux
/dev/sdc4          0+    248-    249-   1998848   83  Linux

2. Other types of disk corruption

Those include shadow devices being corrupted, ungraceful power down, software malfunction, etc…

3. Misconfiguration of aerospike.conf (applicable for server versions prior to 4.2)

Refer to the Changing Device Order in a Namespace knowledge base article for this situation. Also, disk order might get swapped on a machine reboot. e.g. a disk can become /dev/sdb instead of /dev/sdc. Refer to the Using WWID Device Reference for details.

Note: For server versions 4.2 and above, Aerospike can handle a reordering of devices in the configuration across restarts. If the device is from a different namespace, though, it will fail to start with the following message:

Jun 07 2019 22:36:45 GMT: FAILED ASSERTION (drv_ssd): (drv_ssd.c:2246) /dev/vdc: previous namespace nsSSD now test - check config or erase device

There should be no need to use WWID anymore for those newer versions.

Solution

In cases such as this one, assuming this was an issue on only 1 node (or exactly replication-factor - 1 nodes), the issue might be mitigated by restarting the node(s) empty. As such:

  1. Erase data on all aerospike data disks: Zeroize multiple ssds simultaneously. If using files as storage, simply delete the files.
  2. Fix your configuration or re-partition the drive(s) correctly.
  3. Start the aerospike server.
  4. Wait for migrations to finish.
  5. May need to erase shadow devices and repeat the steps again if the above steps are still not recoverable.

This will cold start aerospike on the node, without any data, and migrations will handle the rest (re-populate the data). Since we are starting empty (like when a new node is added), migrations might take a while.

In the unlikely event of having the issue on multiple nodes (more than replication-factor -1 ), the mitigation steps becomes more complicated and restarting multiple nodes empty may result in data loss. In such situations, we would recommend restoring from a potential XDR cluster or a backup file.

The above solution might not work if it’s a device corruption that is not repaired simply by erasing the device. If the issue is persistant after trying the above, we would recommend replacing the drives or replacing the node in the cluster and contacting the drive vendor for further diagnostic.

Keywords

DRIVE UNMATCHED HEADERS SIGNATURES STORAGE DEVICE ORDER

Timestamp

June 7 2019