Segmentation Fault (SIGSEGV) when enabling data-in-memory


#1

One of the issues we encountered was that when memory caching (data-in-memory) is enabled, the “asd” daemon crashes after about 30-60 seconds, even if incurring low load (a few thousand inserted records, with no more than 10 bins each.)

The Amazon machine (c3.xlarge) running the Aerospike has 7.3GB memory, out of which only about 2-3% was being used by asd at the time of the crash. The Aerospike deployment has just one node.

We have temporarily disabled the data-in-memory features (as you can see in the config below). While for the functional tests performance is not critical, we would need the data-in-memory for the load tests we have planned in the up-coming week.

Any help on fixing this issues would be greatly appreciated.

Technical details below.

Configuration:

# Aerospike database configuration file.

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 4
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode multicast
                address 239.1.99.222
                port 9918

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

#namespace test {
#       replication-factor 2
#       memory-size 1G
#       default-ttl 30d # 30 days, use 0 to never expire/evict.
#
#       storage-engine memory
#}

namespace XXXXXXXXX {
        replication-factor 1
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        #storage-engine memory

#        To use file storage backing, comment out the line above and use the
#        following lines instead.
        storage-engine device {
                file /opt/aerospike/data/bar.dat
                filesize 16G
                #data-in-memory true # Store data in memory in addition to file.
        }

The section of the Aerospike log describing the crash:

Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::161) SIGSEGV received, aborting Aerospike Community Edition build 3.7.4 os el6
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: found 7 frames
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x32) [0x48d828]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 1: /lib64/libc.so.6(+0x35670) [0x7fe70a020670]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 2: /usr/bin/asd(cf_queue_push+0xc) [0x54b0e1]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 3: /usr/bin/asd(ssd_post_write+0x3d2) [0x518443]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 4: /usr/bin/asd(ssd_write_worker+0x14c) [0x5187c5]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 5: /lib64/libpthread.so.0(+0x7dc5) [0x7fe70b1f3dc5]
Mar 08 2016 10:30:16 GMT: WARNING (as): (signal.c::163) stacktrace: frame 6: /lib64/libc.so.6(clone+0x6d) 
[0x7fe70a0e1bdd]

Aerospike aborts and stops
#2

Did you provide the full config file? If so, you seem to be missing an ending closing brace } for the namespace config which would explain the issue you are observing.


#3

Hi @meher I`m having exactly the same problem with aerospike community server 3.14.1.4 but experience the same with 3.14.1.3, 3.13.0.7 and 3.9.1.1 which I tested

my full config is

service {
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    proto-fd-max 15000
}
  
logging {
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}
  
network {
    service {
        address any
        port 3000
        access-address 10.8.0.175
    }
 
    heartbeat {
        mode mesh
        address 10.8.0.175
        port 3002
        mesh-seed-address-port 10.8.0.175 3002
        mesh-seed-address-port 10.8.0.176 3002
        interval 150
        timeout 10
    }
 
    fabric {
        port 3001
    }
 
    info {
        port 3003
    }
}
 
namespace  tagstore {
    rack-id 1
    memory-size 8G
    replication-factor 2

    storage-engine device {
        file /data/tagstore.data
        filesize 1G
#        data-in-memory true
    }

    default-ttl 0
    high-water-disk-pct 75
    high-water-memory-pct 90
    stop-writes-pct 98
 
    set {}       # (Optional) Set specific record policies
}

this occures when i try to store more than 10k data points or more


#4

Could you provide the stack trace from the latest build you have tried?


#5

Hi @kporter sure, I`m attaching logs from my 2 node cluster logs.tar.gz (4.3 KB)


#6

The config parser doesn’t support having the open and close brace on the same line, so this becomes equivalent to not having the final brace.

We will be adding a check in the post config processing to ensure that all contexts have been closed.


#7

yep, looks like that was the issue thank you!