Aerospike 3.8.3 cannot start up,contining load but load file hang

The log is as flowing:

Dec 29 2016 09:12:49 GMT: INFO (nsup): (thr_nsup.c:322) {test} cold-start building eviction histogram … Dec 29 2016 09:12:51 GMT: INFO (drv_ssd): (drv_ssd.c:3990) {test} loaded 3984368 records, 0 subrecords, /opt/aerospike/data/bar.dat 19% Dec 29 2016 09:12:53 GMT: INFO (drv_ssd): (drv_ssd.c:3990) {test} loaded 3984368 records, 0 subrecords, /opt/aerospike/data/bar.dat 19% Dec 29 2016 09:12:53 GMT: INFO (drv_ssd): (drv_ssd.c:2088) device /opt/aerospike/data/mustang.dat: used 1122816, contig-free 16381M (16381 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 1 (0.0/s) defrag-w-tot 0 (0.0/s) Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:254) {test} cold-start found no records eligible for eviction Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:381) {test} hwm breached but no records to evict Dec 29 2016 09:12:53 GMT: WARNING (namespace): (namespace.c:440) {test} hwm_breached true (memory), stop_writes false, memory sz:1305638974 (255062720 + 1048205794) hwm:1288490240 sw:1932735232, disk sz:3240409472 hwm:8589934592

Can you share your namespace configuration from the .conf file?

The issue is occurred in test namespace, AS cannot load bar.dat properly.

Aerospike database configuration file for use with systemd.

service { paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1. service-threads 4 transaction-queues 4 transaction-threads-per-queue 4 proto-fd-max 15000 }

logging { file /var/log/aerospike/aerospike.log { context any info context migrate debug } }

network { service { address any port 3000 }

heartbeat {
	mode multicast
	address 239.1.99.222
	port 9918

	# To use unicast-mesh heartbeats, remove the 3 lines above, and see
	# aerospike_mesh.conf for alternative.

	interval 150
	timeout 10
}

fabric {
	port 3001
}

info {
	port 3003
}

}

namespace test { replication-factor 2 memory-size 2G default-ttl 0 # 30 days, use 0 to never expire/evict.

# storage-engine memory

# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
	file /opt/aerospike/data/bar.dat
	filesize 16G
	data-in-memory true # Store data in memory in addition to file.
}

} namespace mustang { replication-factor 2 memory-size 500M default-ttl 0 # 30 days, use 0 to never expire/evict.

    # storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
    storage-engine device {
            file /opt/aerospike/data/mustang.dat
            filesize 16G
            data-in-memory false # Store data in memory in addition to file.
    }

}

Have you looked at basic capacity planning numbers?

Follow this: http://www.aerospike.com/docs/operations/plan/capacity and see if you have adequate RAM to store your index (64 bytes per record) and if data-in-memory is true (for namespace bar) - adequate RAM to store the records? Do a basic estimate of the record size per the above link - overhead plus set name size, number of bins, size of data in each bin, bin overhead etc.

You are perhaps running out of RAM.

You are breaching the memory high water mark during cold start but as there are no records to evict (all records are set to not expire - ttl 0) the server will continue to load the data until it reaches the stop-writes which may indeed then have you go out of memory.

Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:254) {test} cold-start found no records eligible for eviction
Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:381) {test} hwm breached but no records to evict
Dec 29 2016 09:12:53 GMT: WARNING (namespace): (namespace.c:440) {test} hwm_breached true (memory), stop_writes false, memory sz:1305638974 (255062720 + 1048205794) hwm:1288490240 sw:1932735232, disk sz:3240409472 hwm:8589934592

Make sure you don’t over provision your cluster for memory and leave enough RAM for your OS.

I ran into the same scenario I am working on my local machine, and I ended up pushing more than 3M records into the system After which inssertions started failing. I was getting stop write as true but read queries were working fine. I had to restart the aerospike service and thereafter its not starting up Reason is perhaps what you listed. TTL is 0 and in logs am getting this

Aug 01 2017 07:16:09 GMT: INFO (nsup): (thr_nsup.c:333) {account} cold-start building eviction histogram … Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:262) {account} cold-start found no records eligible for eviction Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:394) {account} hwm breached but no records to evict Aug 01 2017 07:16:09 GMT: WARNING (namespace): (namespace.c:453) {account} hwm_breached true (memory), stop_writes false, memory sz:3686300061 (216465408 + 3469834653) hwm:2576980377 sw:3865470566, disk sz:4329308032 hwm:8589934592

How can i recover from this? This is a scenario that can happen if I ever encountered DDOS on my system if I go live. How can I start my service back

I increased the memory footprint in aerospike.conf and it started.

alternatively, just wipe out your disks :wink:

My configuration looks something like this

namespace account { replication-factor 2 memory-size 8G default-ttl 0 # 30 days, use 0 to never expire/evict.

    storage-engine device {
            file /opt/aerospike/data/bar.dat
            filesize 16G
            data-in-memory true # Store data in memory in addition to file.
    }

}

What I expect this to do is to write records to disk with the filesize of 16G and store data in memory in addition to file (which i wanted to do just for performance) I was assuming this data-in-memory is more like a cache But the problem I faced says thats not the case Why then should I be using the data-in-memory flag here?

You would use data-in-memory true to have faster read access to your data and use the file for persistence when restarting a node. You cannot have some data in memory and some on disk…