Aerospike 3.7.0.2 crashing

crash

#1

Hi,

we are seening consistent crashing of all nodes with version 3.7.0.2.

Here is stacktrace seen in logs:

Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::105) SIGFPE received, aborting Aerospike Community Edition build 3.7.0.2 os el6
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: found 16 frames
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_fpe+0x62) [0x48c1da]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 1: /lib64/libc.so.6(+0x35650) [0x7f5fbd716650]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 2: /usr/bin/asd(shash_get+0x1a) [0x54802b]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 3: /usr/bin/asd(as_sindex_sbins_populate+0x60f) [0x4993b0]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 4: /usr/bin/asd(write_local_sindex_update+0x394) [0x4c7869]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 5: /usr/bin/asd(write_local_ssd+0x3ec) [0x4ce474]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 6: /usr/bin/asd(write_local+0x5da) [0x4d1fcf]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 7: /usr/bin/asd() [0x4d2590]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 8: /usr/bin/asd(finish_rw_process_dup_ack+0x7a3) [0x4d3dc4]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 9: /usr/bin/asd(rw_process_ack+0x41a) [0x4d4455]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 10: /usr/bin/asd(write_msg_fn+0x1f7) [0x4d49ac]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 11: /usr/bin/asd(fabric_process_read_msg+0x474) [0x4ee9ca]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 12: /usr/bin/asd(fabric_process_readable+0x3d) [0x4eec98]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 13: /usr/bin/asd(fabric_worker_fn+0x41c) [0x4ef160]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 14: /lib64/libpthread.so.0(+0x7df5) [0x7f5fbe8e9df5]
Dec 17 2015 10:37:55 GMT: WARNING (as): (signal.c::107) stacktrace: frame 15: /lib64/libc.so.6(clone+0x6d) [0x7f5fbd7d7bfd]
Dec 17 2015 10:38:26 GMT: INFO (as): (as.c::410) <><><><><><><><><><>  Aerospike Community Edition build 3.7.0.2  <><><><><><><><><><>

We are running on Amazon EC2, 4.1.10-17.31.amzn1.x86_64 #1 SMP Sat Oct 24 01:31:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux, all system updates installed.

What do you suggest?

Dean


#2

Which version were you using prior to 3.7.0.2? Could you share more about your use case and the features you are using? Can you supply your Aerospike configuration so we could attempt to reproduce. (You might want to mask the IP’s and other sensitive information).

Brief look at the stack trace, it seems like you are using Secondary Indexes, can we get more information specific to your secondary index use case? Also has this issue occurred again since the initial crash? If it has occurred again, we would like the latest stacktrace as well.

Jerry


#3

Hi,

I am Dean’s colleague. We were using 3.6.4 CE on AWS, which worked fine. Then, we upgraded to 3.7.0.1 and server worked for some time (a day), then 2 nodes started to shutdown consistently. We upgraded to 3.7.0.2 and problem has reoccurred. After that we downgraded back to 3.6.4 and server works now.

We are using AS as main database for out web application. Currently, AS is not under huge load as it is used in DEV environment. We are using 25 sets and 43 secondary indexes.

Info about indexes:

indextype: NONE, LIST, MAPKEYS

num_bins: 1

state: RW

sync_state: synced

type: STRING, NUMERIC

Configuration:

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

logging {
    # Log file must be an absolute path.
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 3002 # Heartbeat port for this node.

        # List one or more other nodes, one ip-address & port per line:
        mesh-seed-address-port someIp 3002
        mesh-seed-address-port someIp 3002

        interval 250
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace someNamespace {
    replication-factor 3
    memory-size 10G
    default-ttl 0 # use 0 to never expire/evict.

    storage-engine device {
            device /dev/sdb
                # The 2 lines below optimize for SSD.
                scheduler-mode noop
                write-block-size 128K
    }
}

#4

This is likely due to not handling empty lists for sindex. If there was an empty list at any time used for sindex this would cause a divide by zero crash. It’ll be fixed shortly.

If possible, please confirm the existence/use of empty lists when the above happened. Thank you.


#5

Its most likely that empty list was in secondary index. When can we expect the fix?


#6

Typical release cycle is about 2 weeks. This is already tracked internally as AER-4647