Crashing while running ycsb


#1

Hi,

I’m running ycsb on a single aerospike server with a single 800GB ssd

when starting load a (50% read write),

after a few minutes aerospike crashes with the following errors:

May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1252) read_and_size_all: failed as_storage_record_read_ssd()
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 161309695488
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 486124312576
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 608211788800
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 51915504640
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 601177939968
May 14 2015 18:46:20 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd() [0x4acaa6]
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 428776915968
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 261167933440
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 163195598848
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1252) read_and_size_all: failed as_storage_record_read_ssd()
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 500951696896
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 608833404416
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1198) read: read wrong key: expecting 6d87ffd339c6262c got 3b66e7732adb3f42

here is the config file:

# Aerospike database configuration file.

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 20
        transaction-queues 20
        transaction-threads-per-queue 3
        proto-fd-max 100000
        #min-avail-pct 2
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        access-address 17.20.140.10
        }

        heartbeat {
                mode multicast
                address 239.1.99.222
                port 9918

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace bar {
        replication-factor 1
        memory-size 32G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory
}

namespace test {
        replication-factor 1
        high-water-disk-pct 98
        high-water-memory-pct 98
        memory-size 60G
        stop-writes-pct 99
        default-ttl 240d # 30 days, use 0 to never expire/evict.
        storage-engine device {
           device /dev/sdaa
           write-block-size 128K
           scheduler-mode noop
           defrag-sleep 1000
           defrag-lwm-pct 60
           min-avail-pct 2
      }
}
        
        

do you know what these errors might mean?

Thank you very much


#5

Could you supply the output of:

sudo fdisk -l
sudo hdparm -I /dev/sdaa

Could you also grep your logs for signal.c, looks like there was a stack trace in the logs.

sudo grep /var/log/aerospike/aerospike.log | grep "signal.c"

Lastly, could you provide the Aerospike version and the server version?

asinfo -v "build"
uname -a
cat /etc/issue

#6

Thanks for the reply I have switched to 4 devices now since defrag can’t hold the write speed on only one. I got the same errors and aerospike crushed. does these errors indicate on data corruption?

here are the logs:

sudo fdisk -l :

Disk /dev/mapper/vg_labrador-lv_root: 53.7 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/vg_labrador-lv_swap: 4294 MB, 4294967296 bytes
255 heads, 63 sectors/track, 522 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/vg_labrador-lv_home: 421.6 GB, 421594660864 bytes
255 heads, 63 sectors/track, 51255 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sdaa: 644.2 GB, 644245094400 bytes
255 heads, 63 sectors/track, 78325 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x00000000


Disk /dev/sdab: 644.2 GB, 644245094400 bytes
255 heads, 63 sectors/track, 78325 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x00000000


Disk /dev/sdac: 644.2 GB, 644245094400 bytes
255 heads, 63 sectors/track, 78325 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x00000000


Disk /dev/sdad: 644.2 GB, 644245094400 bytes
255 heads, 63 sectors/track, 78325 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x00000000

sudo hdparm -I /dev/sdaa (all 4 devices are the same):

/dev/sdaa:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ATA device, with non-removable media
Standards:
        Likely used: 1
Configuration:
        Logical         max     current
        cylinders       0       0
        heads           0       0
        sectors/track   0       0
        --
        Logical/Physical Sector size:           512 bytes
        device size with M = 1024*1024:           0 MBytes
        device size with M = 1000*1000:           0 MBytes
        cache/buffer size  = unknown
Capabilities:
        IORDY not likely
        Cannot perform double-word IO
        R/W multiple sector transfer: not supported
        DMA: not supported
        PIO: pio0

sudo cat /var/log/aerospike/aerospike.log | grep “signal.c”:

May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.9
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: found 9 frames
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: asd(as_sig_handle_abort+0x59) [0x46d9a5]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f885be136a0]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x35) [0x7f885be13625]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib64/libc.so.6(abort+0x175) [0x7f885be14e05]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: asd(cf_fault_event+0x271) [0x4ff8d8]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: asd(as_config_post_process+0x232) [0x453e7e]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: asd(main+0x1fc) [0x44f6c2]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f885bdffd5d]
May 18 2015 09:21:02 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: asd() [0x44e669]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.9
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: found 9 frames
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x59) [0x46d9a5]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f167c94e6a0]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x35) [0x7f167c94e625]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib64/libc.so.6(abort+0x175) [0x7f167c94fe05]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x271) [0x4ff8d8]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(ssd_flush_swb+0x127) [0x4f7756]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /usr/bin/asd(ssd_write_worker+0x169) [0x4f90c7]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib64/libpthread.so.0(+0x79d1) [0x7f167d7739d1]
May 18 2015 12:48:12 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: /lib64/libc.so.6(clone+0x6d) [0x7f167ca048fd]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.9
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: found 9 frames
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x59) [0x46d9a5]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f207a6da6a0]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x35) [0x7f207a6da625]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib64/libc.so.6(abort+0x175) [0x7f207a6dbe05]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x271) [0x4ff8d8]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(as_storage_init+0x99) [0x4ef6d3]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /usr/bin/asd(main+0x2f0) [0x44f7b6]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f207a6c6d5d]
May 18 2015 12:48:34 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: /usr/bin/asd() [0x44e669]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.9
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: found 9 frames
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x59) [0x46d9a5]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f3eee2036a0]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x35) [0x7f3eee203625]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib64/libc.so.6(abort+0x175) [0x7f3eee204e05]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x271) [0x4ff8d8]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(as_storage_init+0x99) [0x4ef6d3]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /usr/bin/asd(main+0x2f0) [0x44f7b6]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f3eee1efd5d]
May 18 2015 12:49:21 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: /usr/bin/asd() [0x44e669]
May 18 2015 14:08:00 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.9
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: found 9 frames
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46db08]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f0eed4006a0]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /usr/bin/asd(write_local+0x4c3) [0x4ad047]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd() [0x4b1fd6]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /usr/bin/asd(as_rw_start+0x291) [0x4b3ef9]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /usr/bin/asd(process_transaction+0xe52) [0x4bdbd7]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 6: /usr/bin/asd(thr_tsvc+0x1b) [0x4be0e1]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 7: /lib64/libpthread.so.0(+0x79d1) [0x7f0eee2259d1]
May 19 2015 00:12:48 GMT: WARNING (as): (signal.c::162) stacktrace: frame 8: /lib64/libc.so.6(clone+0x6d) [0x7f0eed4b68fd]

uname -a

Linux Shelby 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

asinfo -v “build”

3.5.9

cat /etc/issue

CentOS release 6.5 (Final)
Kernel \r on an \m

#7

Did you try reducing the defrag-sleep? May help push a bit more performance out of defrag.

Also, the hdparm command didn’t provide the make and model of these drives. Can you provide that information? Are you going through a RAID controller?

If the errors were the same as before

Like these:

May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1252) read_and_size_all: failed as_storage_record_read_ssd()
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 500951696896
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 608833404416

Then that does indicate disk corruption. While your config does have some risky settings, I do not see anything configuration issue that may lead to this.

Know issues to cause this are using a drive larger than 2TiB, as of 3.5.10 we will only use the first 2 TiB of a device and fail if a filesize is larger than 2 TiB. Also defining the same device multiple times in config.

Actual disk corruption would also cause this message as well so I cannot rule out a hardware issue.


#8

H,

Thanks for the answer. I’m actually using a virtual volume that is sitting on 12 Sata ssd’s. what I have noticed is that every time I’m executing commands that impact the server (causes a high latency for 1 seconds)- I’m getting theses error.

could temporary high latency cause these error message?


#9

This would wouldn’t be an expected result of latency. I have also seen high disk latency in the past and these errors were never associated.

Having more than 2 TiB of storage assigned to a single device in Aerospike will lead to these problems. As of 3.5.10 we will error out if file sizes are defined to be larger than 2TiB and only use the first 2TiB of a RAW device.

How large are each of the SATA SSDs?

Is there a reason to bundle them under a single logical drive? You are losing IO parallelism by creating a single device out of them. I would recommend using them as 12 RAW devices.


#10

thanks for the reply

I know I’m loosing IO parallelism, but I just wanted to check how Aerospike will work with linux scst module.

currently I keep crashing.


#11

Looking at the original warning lines:

May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1252) read_and_size_all: failed as_storage_record_read_ssd()
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 161309695488
May 14 2015 18:46:20 GMT: WARNING (drv_ssd): (drv_ssd.c::1192) read: bad block magic offset 486124312576

These warnings are typically symptom of corrupted data/disk, or another process also writing to the same device (for example, 2 different nodes using the same device). Would it possible to check such possibilities?

Thanks.