The issue is occurred in test namespace, AS cannot load bar.dat properly.
Aerospike database configuration file for use with systemd.
service {
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
context migrate debug
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode multicast
address 239.1.99.222
port 9918
# To use unicast-mesh heartbeats, remove the 3 lines above, and see
# aerospike_mesh.conf for alternative.
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 2
memory-size 2G
default-ttl 0 # 30 days, use 0 to never expire/evict.
# storage-engine memory
# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
file /opt/aerospike/data/bar.dat
filesize 16G
data-in-memory true # Store data in memory in addition to file.
}
}
namespace mustang {
replication-factor 2
memory-size 500M
default-ttl 0 # 30 days, use 0 to never expire/evict.
# storage-engine memory
# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
file /opt/aerospike/data/mustang.dat
filesize 16G
data-in-memory false # Store data in memory in addition to file.
}
Have you looked at basic capacity planning numbers?
Follow this: http://www.aerospike.com/docs/operations/plan/capacity and see if you have adequate RAM to store your index (64 bytes per record) and if data-in-memory is true (for namespace bar) - adequate RAM to store the records? Do a basic estimate of the record size per the above link - overhead plus set name size, number of bins, size of data in each bin, bin overhead etc.
You are breaching the memory high water mark during cold start but as there are no records to evict (all records are set to not expire - ttl 0) the server will continue to load the data until it reaches the stop-writes which may indeed then have you go out of memory.
Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:254) {test} cold-start found no records eligible for eviction
Dec 29 2016 09:12:53 GMT: WARNING (nsup): (thr_nsup.c:381) {test} hwm breached but no records to evict
Dec 29 2016 09:12:53 GMT: WARNING (namespace): (namespace.c:440) {test} hwm_breached true (memory), stop_writes false, memory sz:1305638974 (255062720 + 1048205794) hwm:1288490240 sw:1932735232, disk sz:3240409472 hwm:8589934592
Make sure you don’t over provision your cluster for memory and leave enough RAM for your OS.
I ran into the same scenario
I am working on my local machine, and I ended up pushing more than 3M records into the system
After which inssertions started failing. I was getting stop write as true but read queries were working fine.
I had to restart the aerospike service and thereafter its not starting up
Reason is perhaps what you listed.
TTL is 0 and in logs am getting this
Aug 01 2017 07:16:09 GMT: INFO (nsup): (thr_nsup.c:333) {account} cold-start building eviction histogram …
Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:262) {account} cold-start found no records eligible for eviction
Aug 01 2017 07:16:09 GMT: WARNING (nsup): (thr_nsup.c:394) {account} hwm breached but no records to evict
Aug 01 2017 07:16:09 GMT: WARNING (namespace): (namespace.c:453) {account} hwm_breached true (memory), stop_writes false, memory sz:3686300061 (216465408 + 3469834653) hwm:2576980377 sw:3865470566, disk sz:4329308032 hwm:8589934592
How can i recover from this?
This is a scenario that can happen if I ever encountered DDOS on my system if I go live.
How can I start my service back
namespace account {
replication-factor 2
memory-size 8G
default-ttl 0 # 30 days, use 0 to never expire/evict.
storage-engine device {
file /opt/aerospike/data/bar.dat
filesize 16G
data-in-memory true # Store data in memory in addition to file.
}
}
What I expect this to do is to write records to disk with the filesize of 16G
and store data in memory in addition to file (which i wanted to do just for performance)
I was assuming this data-in-memory is more like a cache
But the problem I faced says thats not the case
Why then should I be using the data-in-memory flag here?
You would use data-in-memory true to have faster read access to your data and use the file for persistence when restarting a node. You cannot have some data in memory and some on disk…