All Nodes brake out from cluster and made auto cold start, it is extremely slow (24 days and counting...). build 3.15.0.1

ilan · December 24, 2019, 3:10pm

Hi,

At the 30.11.19 our 9 nodes cluster completely fell down leaving only one node in the cluster. all other 8 nodes made an auto cold start and it still trying to restart since then. This is the log of one of the nodes:

Dec 24 2019 14:23:09 GMT: INFO (nsup): (thr_nsup.c:333) {test} cold-start building eviction histogram ...
Dec 24 2019 14:23:11 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:13 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:15 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:17 GMT: INFO (drv_ssd): (drv_ssd.c:4010) {test} loaded 795079120 records, 0 subrecords, /dev/sdc1 81%, /dev/sdc2 82%, /dev/sdc3 81%
Dec 24 2019 14:23:18 GMT: WARNING (nsup): (thr_nsup.c:262) {test} cold-start found no records eligible for eviction
Dec 24 2019 14:23:18 GMT: WARNING (nsup): (thr_nsup.c:394) {test} hwm breached but no records to evict
Dec 24 2019 14:23:18 GMT: WARNING (namespace): (namespace.c:474) {test} hwm_breached true (disk), stop_writes false, memory sz:50885200768 (50885126016 + 0) hwm:54116587929 sw:81174881894, disk sz:711294503936 hwm:563714457600

Each node has different completion percent, goes from 70% to 88%.

Config file:

    service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 4
        proto-fd-max 15000
    }

    logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
    }   
    namespace test{
        replication-factor 2
        memory-size 84G
        default-ttl 0 # 30 days, use 0 to never expire/evict.

        storage-engine device {                         # Configure the storage-engine to use
                                                        # persistence. Maximum size is 2 TiB
                device /dev/sdc1
                device /dev/sdc2
                device /dev/sdc3
                                                        # in memory.
                write-block-size 128K
                defrag-lwm-pct   60
        }
    }

The heartbeat is in mesh mode. Each Server has 40 cores, 128 GB RAM and 1 TB SSD.

I don’t know exactly why the cluster fell down and auto cold restart itself. My questions are:

Why may this happen?
The fact that all the nodes are restarting in the same time may cause some data loss?
The fact that all the nodes are restarting in the same time make the restart slower?
Is there any way to make the process faster?

Thank you in advance

kporter · January 2, 2020, 8:50pm

Your defualt-ttl is 0. Would this mean that most/all of your records do not use a void time?

While starting up, more than high-water-disk-pct records were loaded which caused the server to enter cold-start eviction. Eviction wasn’t able to find any eligible records (likely since all remaining records have void time of zero). These nodes are essentially stuck in this loop.

Since you seem to be managing your own deletions you can increase the high-water-disk-pct to 100 and restart these nodes.

kporter · January 2, 2020, 8:57pm

Yes, writes to a node are placed in an in memory queue before acking the client (unless using strong-consistency with commit-to-device). Since multiple nodes were lost at the same time there could have been queued unflushed updates on multiple instances that were not flushed on any other instance.

The coldstart is slow becuase it is stuck trying to evict when there isn’t anything to evict.

Mentioned in previous response.

system · January 8, 2020, 8:57pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node restart is stuck fastrestart	5	2470	November 5, 2015
Defrag endless loop and hwm breath issue when cold restart Operations	2	1691	June 19, 2017
Will Data Recover on the Other Cluster or on the Local HDD? How Aerospike Works	6	2357	August 3, 2015
Can we change the time one node take to join cluster after restart? Monitoring	5	771	June 3, 2022
Stale Data Comes Up on Node restart temporarily How Aerospike Works	3	3118	March 21, 2017

All Nodes brake out from cluster and made auto cold start, it is extremely slow (24 days and counting...). build 3.15.0.1

Related topics