Hi,
At the 30.11.19 our 9 nodes cluster completely fell down leaving only one node in the cluster. all other 8 nodes made an auto cold start and it still trying to restart since then. This is the log of one of the nodes:
Dec 24 2019 14:23:09 GMT: INFO (nsup): (thr_nsup.c:333) {test} cold-start building eviction histogram ...
Dec 24 2019 14:23:11 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:13 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:15 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:17 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:18 GMT: WARNING (nsup): (thr_nsup.c:262) {test} cold-start found no records eligible for eviction
Dec 24 2019 14:23:18 GMT: WARNING (nsup): (thr_nsup.c:394) {test} hwm breached but no records to evict
Dec 24 2019 14:23:18 GMT: WARNING (namespace): (namespace.c:474) {test} hwm_breached true (disk), stop_writes false, memory sz:50885200768 (50885126016 + 0) hwm:54116587929 sw:81174881894, disk sz:711294503936 hwm:563714457600
Each node has different completion percent, goes from 70% to 88%.
Config file:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
namespace test{
replication-factor 2
memory-size 84G
default-ttl 0 # 30 days, use 0 to never expire/evict.
storage-engine device { # Configure the storage-engine to use
# persistence. Maximum size is 2 TiB
device /dev/sdc1
device /dev/sdc2
device /dev/sdc3
# in memory.
write-block-size 128K
defrag-lwm-pct 60
}
}
The heartbeat is in mesh mode. Each Server has 40 cores, 128 GB RAM and 1 TB SSD.
I don’t know exactly why the cluster fell down and auto cold restart itself. My questions are:
- Why may this happen?
- The fact that all the nodes are restarting in the same time may cause some data loss?
- The fact that all the nodes are restarting in the same time make the restart slower?
- Is there any way to make the process faster?
Thank you in advance