How To perform basic monitoring and tests on disk health
When using databases, or other long-term storage solutions, is it important to perform periodic disk checks. Disks (SSD and HDD alike) degrade over time and unused sectors may develop faults. This article explains basic basic methods for testing disks for damage.
Method 1 - SMART tests
Most disks support SMART capabilities. In order to ulilise SMART, the
smartmontools package must be installed. On ubuntu for example, the installation would be performed using:
sudo apt-get install smartmontools
Run the following command to test for SMART capabilities:
$ sudo smartctl -i /dev/sdc ... SMART support is: Available - device has SMART capability. SMART support is: Enabled
To run SMART tests, basic options are ‘short’ and ‘long’. A short test will test the electronics, mechanics (if any) and perform a quick test on a small portion of the disk by attempting to read from it. A long test is more desireable and will test all sectors of the disk for readability and (if supported) parity errors.
Note that the tests can be taxing and result in some impact to the server. It is therefore advisable, before performing SMART tests, to quiesce the node and set migrate-fill-delay to ensure it does not receive traffic for the duration of the tests. Following this, the node can be brought back to take transactions using the quiesce-undo command.
To perform a short test:
sudo smartctl -t short /dev/sdc
To perform a long test:
sudo smartctl -t long /dev/sdc
To view the test results and overall disk health:
$ sudo smartctl -a /dev/sdc [...] === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [...] SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2089 - # 2 Extended offline Completed without error 00% 2087 - # 3 Short offline Completed without error 00% 2084 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. [...]
Method 2 - dd read test
It is possible to perform a
dd test to ensure all sectors of the disk can be read. This does not protect against data corruption, but does protect against readability issues. The following runs a
dd test at low IO settings to minimise impact:
$ sudo ionice -c 3 dd if=/dev/sdc bs=1048576 of=/dev/null
Note that with ionice priority 3, this may take a very long time to complete. See man ionice for more information.
Some issues may not be discovered using this test. These include:
- controller firmware issues (for example a controller having issues under certain load)
- disk firmware issues (for example disk having issues if certain large read/write load occurs)
These issues would be transient and not related to a hardware fault, but rather firmware problems. When diagnosing an existing issue, check
dmesg for disk access errors.
SMARTMONTOOLS SMARTCTL SMART DD DISK TEST ERROR HEALTH