Aerospike replica err and asadm strange output on cluster status


#1

Dear Aerospike community&support,

Executing “asadm -h 172.18.106.148” , following output is read. It looks as cluster is broken.

From your documentation on aerospike version older than 3.10, its clearly stated that in case of multiple NIC aerospike nodes, with address any in network/service section, its recommended to set access-address parameter in order to publish only one IP for clients to connect. Our client currently does not have the access-address configured, but it does have address any value setting in the aerospike.conf.

Based on above, is this the cause of heaving such confusing output of the “asadm -h 172.18.106.148”?

This is/are the network configuration section from aerospike.conf:

network {
    service {
        #address 172.18.106.148    <= this entry was previously used / incorrect listening IP
        address any
        port 3000
	#access-address 172.18.106.148   <= this entry was previously used / incorrect published IP 
    }
    heartbeat {
        mode mesh
        port 3002
        address 172.18.106.148

        ##<==============================================>
    	##< Cluster nodes go here, one entry per line. You do not need
        ##< to exclude the "current" host you're on, i.e. the whole set
        ##< of cluster nodes may be listed here.
        mesh-seed-address-port 172.18.106.147 3002
	mesh-seed-address-port 172.18.106.148 3002
        mesh-seed-address-port 172.18.106.149 3002

        interval 150
        timeout 10
    }
    fabric {
        port 3001
    }
    info {
        port 3003
    }
}

In short: for all three nodes:

  • node 1 : service / access-addres should be 172.19.107.237 ; heartbeat / address is 172.18.106.147
  • node2 (principal node): service / access-addres should be 172.19.107.238 ; heartbeat / address 172.18.106.148
  • node 3 : service / access-addres should be 172.19.107.239 ; heartbeat / address is 172.18.106.149

Is it only recommended, or required to set the service / access-address in case of service / address any ?

Second issue, observed from the logs, for same cluster is replica err:

May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4818) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 42949672960 :: master 0

May 18 2017 07:52:13 GMT: INFO (info): (hist.c::137) histogram dump: proxy (19758 total) msec
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::154)  (00: 0000017611) (01: 0000001053) (02: 0000001006) (03: 0000000080)
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::163)  (04: 0000000003) (08: 0000000005)
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (153842 total) msec
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::154)  (00: 0000145959) (01: 0000003811) (02: 0000003950) (03: 0000000120)
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000001)
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 18 2017 07:52:13 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 18 2017 07:52:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/avn-psm.dat: used 2556646400, contig-free 38434M (38434 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 328 (0.0/s), defrag-q 0 defrag-tot 329 (0.0/s)
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4796)  system memory: free 5468760kb ( 67 percent free )
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 3178908 ::: sub_objects 0
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4812)  rec refs 3178908 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 42949672960 :: master 0
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 777 ::: dq 0 : fds - proto (7206, 3800553, 3793347) : hb (4, 4, 0) : fab (44, 44, 0)
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 2232724
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 1900191 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4860) namespace avn-psm: disk inuse: 2556647424 memory inuse: 2033906720 (bytes) sindex memory inuse: 63025440 (bytes) avail pct 93
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4869) namespace avn-psm: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
May 18 2017 07:52:23 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 1365 sync 1394 desync 0 zombie 0 wait 0 absent 1337
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: reads (76012 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::154)  (00: 0000075988) (01: 0000000008) (02: 0000000010) (03: 0000000004)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::163)  (04: 0000000002)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (109043 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::154)  (00: 0000102898) (01: 0000002994) (02: 0000003054) (03: 0000000095)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000001)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: proxy (19759 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::154)  (00: 0000017612) (01: 0000001053) (02: 0000001006) (03: 0000000080)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::163)  (04: 0000000003) (08: 0000000005)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (153849 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::154)  (00: 0000145965) (01: 0000003812) (02: 0000003950) (03: 0000000120)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000001)
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 18 2017 07:52:23 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 18 2017 07:52:26 GMT: INFO (nsup): (thr_nsup.c::1319) {avn-psm} nsup start
May 18 2017 07:52:27 GMT: INFO (nsup): (thr_nsup.c::1237) {avn-psm} Records: 1574159, 1574159 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 1054 ms
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4796)  system memory: free 5468756kb ( 67 percent free )
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 3178912 ::: sub_objects 0
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4812)  rec refs 3178912 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 42949672960 :: master 0
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 258 ::: dq 0 : fds - proto (7249, 3800672, 3793423) : hb (4, 4, 0) : fab (44, 44, 0)
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 2232988
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 1900417 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4860) namespace avn-psm: disk inuse: 2556649344 memory inuse: 2033908034 (bytes) sindex memory inuse: 63025440 (bytes) avail pct 93
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4869) namespace avn-psm: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
May 18 2017 07:52:33 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 1365 sync 1394 desync 0 zombie 0 wait 0 absent 1337
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: reads (76024 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::154)  (00: 0000076000) (01: 0000000008) (02: 0000000010) (03: 0000000004)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::163)  (04: 0000000002)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (109051 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::154)  (00: 0000102906) (01: 0000002994) (02: 0000003054) (03: 0000000095)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000001)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: proxy (19761 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::154)  (00: 0000017614) (01: 0000001053) (02: 0000001006) (03: 0000000080)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::163)  (04: 0000000003) (08: 0000000005)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (153861 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::154)  (00: 0000145977) (01: 0000003812) (02: 0000003950) (03: 0000000120)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000001)
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 18 2017 07:52:33 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count

From the above I see the Cluster Size is 3 / there is sufficient memory. All communication ports are open. We have already once tried rebooting principal node and it did not help. As you can see the load avergae is quite high, also since reboot: load average: 6.90, 6.72, 6.69

Please advise how to proceed? Am I even reading the logs correctly? Help is appreciated.

Marko Vrgotic ActiveVideo


Reschedule automatic midnight cluster sync?
#2

3.5.15 is nearly 2 years old. I suggest looking through the release notes and determining if any of the known issues match yours.

I suggest upgrading. There has been a lot of work in the clustering algorithms since 3.5.15 which will likely resolve that particular issue.


#3

Dear KPorter,

Thank you for advice. I could not agree more, but due to our software hotfix/service release cycle, we are not able to just execute an upgrade. Anyway, that is our problem.

I believe I did find fixes for reported observation, already in 3.6.0 & 3.7.1, which is good. One of the “workaround” suggested by Aerospike in the issue with high load increasing replica err is to reboot principal node in order to rejig the partitions and all to converge again, which is acceptable atm.

Still, I would like to ask the following:

  • how to trace when did the issue started?
  • which service is in charge of reporting lost member or recovered member?
  • can i reproduce such a behavior and how? any suggestions?
  • my main issue here is not being sure if the bug fixes we see below are indeed related.

The fixes I found, which most likely are related to the issues we observe:

(KVS) [AER-4026] - Fix incorrect processing of interface name resulting in bogus nodeid generated. (KVS) [AER-3968] - Fix negative heartbeat accounting. (KVS) [AER-3993], [AER-3897] - Fix crash caused by invalid message operations. (KVS) [AER-4237] - Fix for err_sync_copy_null_node.

Thank you. Kindly awaiting your reply.


#4

Sorry for your release cycle situation; however, understand that archeology on ancient versions can be very time consuming and typically isn’t mutually beneficial when the issue had been previously discovered.

If you have an Enterprise support contract then I suggest opening a support ticket.