How To perform basic monitoring and tests on disk health

How To perform basic monitoring and tests on disk health

Context

When using databases, or other long-term storage solutions, is it important to perform periodic disk checks. Disks (SSD and HDD alike) degrade over time and unused sectors may develop faults. This article explains basic basic methods for testing disks for damage.

Method 1 - SMART tests

Most disks support SMART capabilities. In order to ulilise SMART, the smartmontools package must be installed. On ubuntu for example, the installation would be performed using:

sudo apt-get install smartmontools

Run the following command to test for SMART capabilities:

$ sudo smartctl -i /dev/sdc
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

To run SMART tests, basic options are ‘short’ and ‘long’. A short test will test the electronics, mechanics (if any) and perform a quick test on a small portion of the disk by attempting to read from it. A long test is more desireable and will test all sectors of the disk for readability and (if supported) parity errors.

Note that the tests can be taxing and result in some impact to the server. It is therefore advisable, before performing SMART tests, to quiesce the node and set migrate-fill-delay to ensure it does not receive traffic for the duration of the tests. Following this, the node can be brought back to take transactions using the quiesce-undo command.

To perform a short test: sudo smartctl -t short /dev/sdc

To perform a long test: sudo smartctl -t long /dev/sdc

To view the test results and overall disk health:

$ sudo smartctl -a /dev/sdc
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2089         -
# 2  Extended offline    Completed without error       00%      2087         -
# 3  Short offline       Completed without error       00%      2084         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
[...]

Method 2 - dd read test

It is possible to perform a dd test to ensure all sectors of the disk can be read. This does not protect against data corruption, but does protect against readability issues. The following runs a dd test at low IO settings to minimise impact:

$ sudo ionice -c 3 dd if=/dev/sdc bs=1048576 of=/dev/null

Note that with ionice priority 3, this may take a very long time to complete. See man ionice for more information.

Notes

Some issues may not be discovered using this test. These include:

  • controller firmware issues (for example a controller having issues under certain load)
  • disk firmware issues (for example disk having issues if certain large read/write load occurs)

These issues would be transient and not related to a hardware fault, but rather firmware problems. When diagnosing an existing issue, check dmesg for disk access errors.

Keywords

SMARTMONTOOLS SMARTCTL SMART DD DISK TEST ERROR HEALTH

Timestamp

September 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.