Node restart is stuck

fastrestart

#1

Hi,

I restarted (cold) a node in my cluster after it was showing Cluster Integrity Check issues. It started loading the data pretty quickly and after about 20min it loaded about as many records as I was expecting (based on the number of objects reported before I bounced it). However, after that it got stuck fluctuating up and down around that number for hours. It’s been almost 4 hours now and it still hasn’t finished and is stuck around that number. Any suggestions?

Here is a snippet of the log:

Sep 25 2015 01:58:21 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301216756 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:23 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301188191 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:25 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301159538 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:27 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301130740 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:29 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301102425 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:31 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301073588 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:33 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301044935 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:35 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301016196 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:37 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300987484 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:39 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300959039 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:41 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300930397 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:43 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300901588 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:45 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300872605 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:47 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300843908 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:49 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300815345 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 01:58:51 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 300786802 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
...

I tried restarting again but the same thing continued happening.

Update: It eventually finished after 4.5h. This is way more than the 40-ish minutes I was expecting - any idea if this is normal and if there’s a way to improve it?

Sep 25 2015 02:04:28 GMT: INFO (namespace): (namespace_cold.c::101) ns db beginning COLD start
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3944) usable device size must be header size 1048576 + multiple of 1048576, rounding down
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::4055) opened device /dev/xvdc: usable size 800164151296, io-min-size 512
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3944) usable device size must be header size 1048576 + multiple of 1048576, rounding down
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::4055) opened device /dev/xvdb: usable size 800164151296, io-min-size 512
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::1086) /dev/xvdc has 1526192 wblocks of size 524288
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::1086) /dev/xvdb has 1526192 wblocks of size 524288
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3589) device /dev/xvdc: reading device to load index
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3589) device /dev/xvdb: reading device to load index
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3594) In TID 25088: Using arena #150 for loading data for namespace "db"
Sep 25 2015 02:04:28 GMT: INFO (drv_ssd): (drv_ssd.c::3594) In TID 25089: Using arena #150 for loading data for namespace "db"
Sep 25 2015 02:04:30 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 846435 records, 0 subrecords, /dev/xvdc 0%, /dev/xvdb 0%
....
Sep 25 2015 06:29:05 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301508635 records, 0 subrecords, /dev/xvdc 52%, /dev/xvdb 52%
Sep 25 2015 06:29:06 GMT: INFO (drv_ssd): (drv_ssd.c::3615) device /dev/xvdc: read complete: UNIQUE 229911509 (REPLACED 9074715) (GEN 77267853) (EXPIRED 2192692) (MAX-TTL
 0) records
Sep 25 2015 06:29:07 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {db} loaded 301777179 records, 0 subrecords, /dev/xvdc 100%, /dev/xvdb 52%
Sep 25 2015 06:29:09 GMT: INFO (drv_ssd): (drv_ssd.c::3615) device /dev/xvdb: read complete: UNIQUE 230065022 (REPLACED 9063689) (GEN 77334062) (EXPIRED 2170835) (MAX-TTL
 0) records
Sep 25 2015 06:29:09 GMT: INFO (drv_ssd): (drv_ssd.c::1051) ns db loading free & defrag queues 
Sep 25 2015 06:29:09 GMT: INFO (drv_ssd): (drv_ssd.c::985) /dev/xvdc init defrag profile: 441,1007,1478,1848,2238,2561,2870,3408,3880,4347,4777,5279,6080,6491,7005,7594,8 
336,9079,9776,10578,11205,11846,12770,13613,13982,15253,15747,16291,17064,17414,17833,18144,18558,18928,18625,18867,19139,18983,19038,18693,18456,18534,18167,17905,17634,
17123,16545,16457,15603,14839
Sep 25 2015 06:29:09 GMT: INFO (drv_ssd): (drv_ssd.c::985) /dev/xvdb init defrag profile: 438,990,1463,1988,2301,2588,2882,3308,3787,4242,4862,5301,6057,6365,6985,7780,82
65,9096,9750,10438,11004,12136,12864,13517,13628,14968,15438,16456,17008,17339,18041,18293,18421,18533,18922,18912,18958,19186,19037,18926,18492,18638,18210,17956,17427,1
7236,16627,16311,15686,14902
Sep 25 2015 06:29:10 GMT: INFO (drv_ssd): (drv_ssd.c::1075) /dev/xvdc init wblock free-q 717728, defrag-q 602329
Sep 25 2015 06:29:10 GMT: INFO (drv_ssd): (drv_ssd.c::1075) /dev/xvdb init wblock free-q 717746, defrag-q 601958
Sep 25 2015 06:29:10 GMT: INFO (drv_ssd): (drv_ssd.c::2768) ns db starting device maintenance threads

Thanks!


#2

Any thoughts? I live in constant fear of restarting my nodes now …


#3
  1. Are you using Enterprise or Community?

    This isn’t completely abnormal but can be improved drastically when using the Fast Restart feature in our enterprise versions.

    The fact that it was “stuck” and eventually ended at 52% is interesting. Ending around 50% is also normal as the percentage indicates what percentage of the disk we have read up to and with a high-water-disk-pct at 50 we would expect to read about this far unless at some point you loaded way past this water mark. The time it was “stuck” here we can see that the number of records we have loaded was decreasing, this would indicate that Aerospike has stopped loading and is doing eviction. Normal coldstarts do not need to do much eviction, but if you were to shutdown when eviction is behind then the added rounds of eviction during coldstart can significantly impact coldstart performance.

    That said our coldstart is limited by the read performance of the underlying hardware, what model SSDs are you using and what size are they?

  2. Also could you share your /etc/aerospike/aerospike.conf?


#4

@kporter,

This is with the Community version on AWS ec2 i2.2xlarge. Interestingly enough the disk was only about 15-20% full at the time. We do use TTL on all sets so it is possible that there was a lot of eviction to be done.

Here’s the config:

service {
        user root
        group root
        paxos-single-replica-limit 1 
        pidfile /var/run/aerospike/asd.pid
        service-threads 8
        transaction-queues 8
        transaction-threads-per-queue 8
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }
         heartbeat {
                mode mesh
                port 3002

            mesh-seed-address-port 10.158.130.242 3002
                mesh-seed-address-port 10.155.175.218 3002
                .... #total 20 nodes

                interval 150
                timeout 20
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace biddingdb {
	replication-factor 2
        memory-size 54G         
	storage-engine device {     
        	device /dev/xvdc    
                device /dev/xvdb
	        write-block-size 512K   
	}
}

#5

This configuration would significantly benefit from Fast Restart.

To reduce the number of eviction rounds needed during Coldstart you could increase evict-tenths-pct. Increasing from 5 to 10 would reduce the number of eviction rounds by a factor of 2, but you will have up to 0.5% less data.

Alternatively you could increase high-water-disk-pct during coldstart to prevent these cycles and then during runtime use:

asadm -e "asinfo -v 'set-config:context=namespace;id=NAMESPACE_NAME;high-water-disk-pct=50'"

to set it back to the normal value.


#6

Finally got into this situation again and got to try it and it worked! I wasn’t sure which one was above the hwm so I set both in the config:

 high-water-memory-pct 80
 high-water-disk-pct 80

Then after startup I set them back to the defaults:

asinfo -v 'set-config:context=namespace;id=biddingdb;high-water-disk-pct=50'
asinfo -v 'set-config:context=namespace;id=biddingdb;high-water-memory-pct=60'

Thanks!