Observing steady increase timeouts from aerospike server

error

#1

I am having setup of 5 node Aerospike server (Community server 3.7.3 and C client 4.0.0 with libev) with replica factor of 2. I am using default timeout of 1000ms (i.e. 1 sec on client side). These 5 nodes run inside VM with 4 vCPU with hardly any CPU utilization (~25-30% of CPU utilization and system load reported by top at 0.2). I am only having in-memory data (no SSD).

Server configuration –

   service {
                user root
                group root
            paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
            pidfile /var/run/aerospike/asd.pid
            service-threads 4
            transaction-queues 4
            transaction-threads-per-queue 4
            proto-fd-max 15000
    }

    logging {
            # Log file must be an absolute path.
            file /var/log/aerospike/aerospike.log {
                    context any info
            }
    }

    network {
        service {
                address any
                port 3000
        }

        heartbeat {
                #mode multicast
                #address 239.1.99.223
                #port 9918
                mode mesh
                port 3002

                mesh-seed-address-port 172.16.0.43 3002
                mesh-seed-address-port 172.16.0.91 3002
                mesh-seed-address-port 172.16.0.13 3002
                mesh-seed-address-port 172.16.0.41 3002
                mesh-seed-address-port 172.16.0.89 3002

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

I am observing that timeouts reported by C client increases slowly (in the run of 20-30 min) and suddenly I start seeing huge number of timeouts. I am also observing bandwidth usage of cluster side interface increases. It starts from ~50 Mbps, increases to ~150 Mbps in about 5 min. If I keep the setup running for 15-20 min it reaches 750 Mbps. (all numbers are reported by nload utility)

Following is the snippet from log file –

When system is working fine (extremely few timeouts) –

Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (47318 total) msec
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::154)  (00: 0000045390) (01: 0000000990) (02: 0000000362) (03: 0000000292)
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::154)  (04: 0000000157) (05: 0000000067) (06: 0000000036) (07: 0000000012)
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::163)  (08: 0000000012)
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: udf (51812 total) msec
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::163)  (00: 0000051803) (01: 0000000009)
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Mar 10 2016 13:04:34 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::4979)  system memory: free 3675032kb ( 90 percent free ) 
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::4985)  ClusterSize 5 ::: objects 108539 ::: sub_objects 0
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::4994)  rec refs 108539 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::4999)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5009)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (173, 322, 149) : hb (8, 8, 0) : fab (57, 57, 0)
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5011)    heartbeat_received: self 0 : foreign 9701
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5012)    heartbeat_stats: bt 0 bf 8861 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5024)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5041) {test} disk bytes used 0 : avail pct 100
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5043) {test} memory bytes used 63921750 (index 6942720 : sindex 0 : data 56979030) : used pct 1.49
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5091) {test} migrations - complete
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5041) {statistics} disk bytes used 0 : avail pct 100
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5043) {statistics} memory bytes used 38115 (index 576 : sindex 0 : data 37539) : used pct 0.00
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5091) {statistics} migrations - complete
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5041) {debug_statistics} disk bytes used 0 : avail pct 100
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5043) {debug_statistics} memory bytes used 70909 (index 3200 : sindex 0 : data 67709) : used pct 0.00
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5091) {debug_statistics} migrations - complete
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5041) {bar} disk bytes used 0 : avail pct 100
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5043) {bar} memory bytes used 0 (index 0 : sindex 0 : data 0) : used pct 0.00
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5091) {bar} migrations - complete
Mar 10 2016 13:04:44 GMT: INFO (info): (thr_info.c::5098)    partitions: actual 3396 sync 3324 desync 0 zombie 0 absent 9664
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (51037 total) msec
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::154)  (00: 0000048914) (01: 0000001122) (02: 0000000385) (03: 0000000298)
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::154)  (04: 0000000171) (05: 0000000082) (06: 0000000041) (07: 0000000012)
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::163)  (08: 0000000012)
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: udf (55802 total) msec
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::163)  (00: 0000055792) (01: 0000000010)
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Mar 10 2016 13:04:44 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count


**When system is not working as per expectation (large number of timeouts) --**

`Mar 10 2016 12:55:42 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::4979)  system memory: free 2684656kb ( 66 percent free ) 
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::4985)  ClusterSize 5 ::: objects 1581008 ::: sub_objects 0
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::4994)  rec refs 1581008 ::: rec locks 0 ::: trees 0 ::: wr reqs 114 ::: mig tx 0 ::: mig rx 0
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::4999)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5009)    trans_in_progress: wr 114 prox 0 wait 145 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (505, 32850, 32345) : hb (8, 8, 0) : fab (72, 72, 0)
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5011)    heartbeat_received: self 0 : foreign 111742
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5012)    heartbeat_stats: bt 0 bf 109416 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5024)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 114
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5041) {test} disk bytes used 0 : avail pct 100
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5043) {test} memory bytes used 934663204 (index 101180736 : sindex 0 : data 833482468) : used pct 21.76
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5091) {test} migrations - complete
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5041) {statistics} disk bytes used 0 : avail pct 100
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5043) {statistics} memory bytes used 38115 (index 576 : sindex 0 : data 37539) : used pct 0.00
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5091) {statistics} migrations - complete
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5041) {debug_statistics} disk bytes used 0 : avail pct 100
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5043) {debug_statistics} memory bytes used 71320 (index 3200 : sindex 0 : data 68120) : used pct 0.00
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5091) {debug_statistics} migrations - complete
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5041) {bar} disk bytes used 0 : avail pct 100
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5043) {bar} memory bytes used 0 (index 0 : sindex 0 : data 0) : used pct 0.00
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5091) {bar} migrations - complete
Mar 10 2016 12:55:52 GMT: INFO (info): (thr_info.c::5098)    partitions: actual 3396 sync 3324 desync 0 zombie 0 absent 9664
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (758706 total) msec
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::154)  (00: 0000200800) (01: 0000124163) (02: 0000099399) (03: 0000060402)
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::154)  (04: 0000032839) (05: 0000027102) (06: 0000031267) (07: 0000033507)
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::154)  (08: 0000050246) (09: 0000053600) (10: 0000038620) (11: 0000006761)
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: udf (813851 total) msec
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::163)  (00: 0000813560) (01: 0000000280) (02: 0000000011)
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Mar 10 2016 12:55:52 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count

Moreover on client side I am also seeing udf timeouts, where as logs shows none of UDF execution is taking more than 4ms. Which would mean most likely udf request is spending too much time waiting somewhere (probably some kind of request queue)

Can someone provide any clue ?