Aerospike 3.8.3 cannot start up,contining load but load file hang


#1

The log is as flowing:

Dec 29 2016 09:12:49 GMT: INFO (nsup): (thr_nsup.c:322) {test} cold-start building eviction histogram … Dec 29 2016 09:12:51 GMT: INFO (drv_ssd): (drv_ssd.c:3990) {test} loaded 3984368 records, 0 subrecords, /opt/aerospike/data/bar.dat 19% Dec 29 2016 09:12:53 GMT: INFO (drv_ssd): (drv_ssd.c:3990) {test} loaded 3984368 records, 0 subrecords, /opt/aerospike/data/bar.dat 19% Dec 29 2016 09:12:53 GMT: INFO (drv_ssd): (drv_ssd.c:2088) device /opt/aerospike/data/mustang.dat: used 1122816, contig-free 16381M (16381 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 1 (0.0/s) defrag-w-tot 0 (0.0/s) Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:254) {test} cold-start found no records eligible for eviction Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:381) {test} hwm breached but no records to evict Dec 29 2016 09:12:53 GMT: WARNING (namespace): (namespace.c:440) {test} hwm_breached true (memory), stop_writes false, memory sz:1305638974 (255062720 + 1048205794) hwm:1288490240 sw:1932735232, disk sz:3240409472 hwm:8589934592


#2

Can you share your namespace configuration from the .conf file?


#3

The issue is occurred in test namespace, AS cannot load bar.dat properly.

Aerospike database configuration file for use with systemd.

service { paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1. service-threads 4 transaction-queues 4 transaction-threads-per-queue 4 proto-fd-max 15000 }

logging { file /var/log/aerospike/aerospike.log { context any info context migrate debug } }

network { service { address any port 3000 }

heartbeat {
	mode multicast
	address 239.1.99.222
	port 9918

	# To use unicast-mesh heartbeats, remove the 3 lines above, and see
	# aerospike_mesh.conf for alternative.

	interval 150
	timeout 10
}

fabric {
	port 3001
}

info {
	port 3003
}

}

namespace test { replication-factor 2 memory-size 2G default-ttl 0 # 30 days, use 0 to never expire/evict.

# storage-engine memory

# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
	file /opt/aerospike/data/bar.dat
	filesize 16G
	data-in-memory true # Store data in memory in addition to file.
}

} namespace mustang { replication-factor 2 memory-size 500M default-ttl 0 # 30 days, use 0 to never expire/evict.

    # storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
    storage-engine device {
            file /opt/aerospike/data/mustang.dat
            filesize 16G
            data-in-memory false # Store data in memory in addition to file.
    }

}


#4

Have you looked at basic capacity planning numbers?

Follow this: http://www.aerospike.com/docs/operations/plan/capacity and see if you have adequate RAM to store your index (64 bytes per record) and if data-in-memory is true (for namespace bar) - adequate RAM to store the records? Do a basic estimate of the record size per the above link - overhead plus set name size, number of bins, size of data in each bin, bin overhead etc.

You are perhaps running out of RAM.


#5

You are breaching the memory high water mark during cold start but as there are no records to evict (all records are set to not expire - ttl 0) the server will continue to load the data until it reaches the stop-writes which may indeed then have you go out of memory.

Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:254) {test} cold-start found no records eligible for eviction
Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:381) {test} hwm breached but no records to evict
Dec 29 2016 09:12:53 GMT: WARNING (namespace): (namespace.c:440) {test} hwm_breached true (memory), stop_writes false, memory sz:1305638974 (255062720 + 1048205794) hwm:1288490240 sw:1932735232, disk sz:3240409472 hwm:8589934592

Make sure you don’t over provision your cluster for memory and leave enough RAM for your OS.


#6

I ran into the same scenario I am working on my local machine, and I ended up pushing more than 3M records into the system After which inssertions started failing. I was getting stop write as true but read queries were working fine. I had to restart the aerospike service and thereafter its not starting up Reason is perhaps what you listed. TTL is 0 and in logs am getting this

Aug 01 2017 07:16:09 GMT: INFO (nsup): (thr_nsup.c:333) {account} cold-start building eviction histogram … Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:262) {account} cold-start found no records eligible for eviction Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:394) {account} hwm breached but no records to evict Aug 01 2017 07:16:09 GMT: WARNING (namespace): (namespace.c:453) {account} hwm_breached true (memory), stop_writes false, memory sz:3686300061 (216465408 + 3469834653) hwm:2576980377 sw:3865470566, disk sz:4329308032 hwm:8589934592

How can i recover from this? This is a scenario that can happen if I ever encountered DDOS on my system if I go live. How can I start my service back


#7

I increased the memory footprint in aerospike.conf and it started.


#8

alternatively, just wipe out your disks :wink:


#9

My configuration looks something like this

namespace account { replication-factor 2 memory-size 8G default-ttl 0 # 30 days, use 0 to never expire/evict.

    storage-engine device {
            file /opt/aerospike/data/bar.dat
            filesize 16G
            data-in-memory true # Store data in memory in addition to file.
    }

}

What I expect this to do is to write records to disk with the filesize of 16G and store data in memory in addition to file (which i wanted to do just for performance) I was assuming this data-in-memory is more like a cache But the problem I faced says thats not the case Why then should I be using the data-in-memory flag here?


#10

You would use data-in-memory true to have faster read access to your data and use the file for persistence when restarting a node. You cannot have some data in memory and some on disk…