Potential SSD issues


#1

Potential symptoms of bad SSD drive:

WARNING (drv_ssd): (drv_ssd.c::1165) read failed: expected 512 got -1: fd 9900 data 0x7f277981b000 errno 5

This message tends towards some hardware issue. Basically we tried to read 512 bytes from the device and got -1 (error) which is EIO (I/O error).

Some things to check on:

  • Is RAID used in any configuration? (this would amplify write load).
  • Was the device over provisioned?
  • How many TB per day is being pushed? What is the common load?
  • What are the defrag settings? Are defaults being used or is defrag being set to too aggressive? This is also to figure out if there has been more write pressure then usual.
  • Check if firmware is the latest one.
  • For overprovisioning, big difference between partition overprovisioning vs. hdparm (Host Protected Area):
    • The difference between the 2 depends on the controller. It is not guarantied the controller will make use of the unpartitioned space. For example, Micron controller (firmware) does not. Intel does make user of unpartitioned space. Samsung, it varies.

Example output from a Samsung 843T:

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG MZ7WD480HCGM-00003
Serial Number: xxxxxxxxxxxxx
LU WWN Device Id: 5 002538 5001bdf08
Firmware Version: DXM9103Q
User Capacity: 480.103.981.056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Sep 22 10:36:01 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Now, 3 weeks after installing Aerospike, 4 out of 16 devices show these SMART Errors. The Latency grows on these nodes and we got these read failed errors.

Here is the SMART output:

Error 18182 occurred at disk power-on lifetime: 260 hours (10 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH

00 51 e1 1f 4a 72 e2 Error:

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

27 00 00 00 00 00 00 03 00:15:38.321 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Interrupted (host reset) 90% 260 -
2 Extended offline Completed without error 00% 39 -

Potential errors in kern.log

easy5 kernel: [939269.314716] end_request: I/O error, dev sdb, sector 41044480

Other output:

The 480GB SM843T from Samsung was over provisioned with hdparm

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG MZ7WD480HCGM-00003
Serial Number: S1G1NYAF506443
LU WWN Device Id: 5 002538 5001bdf0c
Firmware Version: DXM9103Q
User Capacity: 379.282.145.280 bytes [379 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Sep 30 16:47:58 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Error 38960 occurred at disk power-on lifetime: 499 hours (20 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 51 a1 5f e8 6d ec Error:

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
27 00 00 00 00 00 00 00 00:29:57.371 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Kern.log:

Sep 30 15:57:21 easy5 kernel: [707766.123460] ata8.00: configured for UDMA/33
Sep 30 15:57:21 easy5 kernel: [707766.123478] ata8: EH complete
Sep 30 15:57:21 easy5 kernel: [707766.123489] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
Sep 30 15:57:21 easy5 kernel: [707766.128210] sas: sas_ata_task_done: SAS error 2
Sep 30 15:57:21 easy5 kernel: [707766.128219] sd 2:0:1:0: [sdb] command ffff881217ac2100 timed out
Sep 30 15:57:21 easy5 kernel: [707766.128250] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Sep 30 15:57:21 easy5 kernel: [707766.128254] sas: ata8: end_device-2:1: cmd error handler
Sep 30 15:57:21 easy5 kernel: [707766.128266] sas: ata7: end_device-2:0: dev error handler
Sep 30 15:57:21 easy5 kernel: [707766.128270] sas: ata8: end_device-2:1: dev error handler
Sep 30 15:57:21 easy5 kernel: [707766.128277] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 15:57:21 easy5 kernel: [707766.135201] ata8.00: failed command: READ DMA
Sep 30 15:57:21 easy5 kernel: [707766.142002] ata8.00: cmd c8/00:00:00:e8:6d/00:00:00:00:00/ec tag 15 dma 131072 in
Sep 30 15:57:21 easy5 kernel: [707766.142002] res 01/04:01:30:08:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
Sep 30 15:57:21 easy5 kernel: [707766.155874] ata8.00: status: { ERR }
Sep 30 15:57:21 easy5 kernel: [707766.162790] ata8.00: error: { ABRT }
Sep 30 15:57:21 easy5 kernel: [707766.169614] ata8: hard resetting link
Sep 30 15:57:21 easy5 kernel: [707766.621403] ata8.00: configured for UDMA/33
Sep 30 15:57:21 easy5 kernel: [707766.621421] ata8: EH complete
Sep 30 15:57:21 easy5 kernel: [707766.625124] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
Sep 30 15:57:21 easy5 kernel: [707766.627023] sas: sas_ata_task_done: SAS error 2
Sep 30 15:57:21 easy5 kernel: [707766.627031] sd 2:0:1:0: [sdb] command ffff881217ac2100 timed out
Sep 30 15:57:21 easy5 kernel: [707766.627406] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Sep 30 15:57:21 easy5 kernel: [707766.627411] sas: ata8: end_device-2:1: cmd error handler
Sep 30 15:57:21 easy5 kernel: [707766.627428] sas: ata7: end_device-2:0: dev error handler
Sep 30 15:57:21 easy5 kernel: [707766.627440] sas: ata8: end_device-2:1: dev error handler
Sep 30 15:57:21 easy5 kernel: [707766.627446] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 15:57:21 easy5 kernel: [707766.634389] ata8.00: failed command: READ DMA
Sep 30 15:57:21 easy5 kernel: [707766.641213] ata8.00: cmd c8/00:00:00:e8:6d/00:00:00:00:00/ec tag 18 dma 131072 in
Sep 30 15:57:21 easy5 kernel: [707766.641213] res 01/04:01:30:08:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
Sep 30 15:57:21 easy5 kernel: [707766.655131] ata8.00: status: { ERR }
Sep 30 15:57:21 easy5 kernel: [707766.662080] ata8.00: error: { ABRT }
Sep 30 15:57:21 easy5 kernel: [707766.668902] ata8: hard resetting link
Sep 30 15:57:22 easy5 kernel: [707767.121278] ata8.00: configured for UDMA/33
Sep 30 15:57:22 easy5 kernel: [707767.121296] sd 2:0:1:0: [sdb]
Sep 30 15:57:22 easy5 kernel: [707767.121298] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 15:57:22 easy5 kernel: [707767.121300] sd 2:0:1:0: [sdb]
Sep 30 15:57:22 easy5 kernel: [707767.121301] Sense Key : Aborted Command [current] [descriptor]
Sep 30 15:57:22 easy5 kernel: [707767.121304] Descriptor sense data with sense descriptors (in hex):
Sep 30 15:57:22 easy5 kernel: [707767.121305] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 30 15:57:22 easy5 kernel: [707767.121312] 00 00 08 30
Sep 30 15:57:22 easy5 kernel: [707767.121316] sd 2:0:1:0: [sdb]
Sep 30 15:57:22 easy5 kernel: [707767.121317] Add. Sense: No additional sense information
Sep 30 15:57:22 easy5 kernel: [707767.121319] sd 2:0:1:0: [sdb] CDB:
Sep 30 15:57:22 easy5 kernel: [707767.121320] Read(10): 28 00 0c 6d e8 00 00 01 00 00
Sep 30 15:57:22 easy5 kernel: [707767.121327] end_request: I/O error, dev sdb, sector 208529408
Sep 30 15:57:22 easy5 kernel: [707767.128206] ata8: EH complete
Sep 30 15:57:22 easy5 kernel: [707767.128288] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1