Aerospike slow performance write/batch-read

I am using community v3.12.1.2 with 14 nodes. Each 64 GB RAM and 24 cores. My server config:

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    #service-threads 24
    #transaction-queues 24
    transaction-threads-per-queue 4
    proto-fd-max 30000
    transaction-pending-limit 20
    auto-pin cpu
}

logging {
    # Log file must be an absolute path.
    file /var/log/aerospike/aerospike.log {
            context any info
    }
}

network {
    service {
            address bond0
            port 3000
            access-address bond0
    }

    heartbeat {


#               mode multicast
 #              address  224.0.0.116 # 239.1.99.222
  #                     port 9918

mode mesh
port 3002

mesh-seed-address-port 10.56.101.184 3002 #a1
# followed by 13 more machines

#mesh-port 3002

            # To use unicast-mesh heartbeats, remove the 3 lines above, and see
            # aerospike_mesh.conf for alternative.

            interval 150
            timeout 10
    }

    fabric {
            address bond0
            port 3001
    }

    info {
            port 3003
    }
}

namespace test {
    replication-factor 2
    memory-size 1G
    default-ttl 30d # 30 days, use 0 to never expire/evict.

    storage-engine memory
}


namespace production {
  replication-factor 2
  memory-size 59G
  default-ttl 0 # 30 days, use 0 to never expire/evict.
    high-water-disk-pct 50 # How full may the disk become before the
                           # server begins eviction (expiring records
                           # early)
    high-water-memory-pct 85 # How full may the memory become before the
                             # server begins eviction (expiring records
                             # early)
    stop-writes-pct 90  # How full may the memory become before
                        # we disallow new writes
  # storage-engine memory
  storage-engine device {
            #device /dev/sdb1
            #data-in-memory false

file /opt/aerospike/data/production.data
filesize 1000G # 8 times of RAM
data-in-memory true

            #write-block-size 128K   # adjust block size to make it efficient for SSDs.
            # largwst size of any object
    }
}

Server settings:

$ asinfo -h localhost -v “get-config:context=service”

paxos-single-replica-limit=1;pidfile=/var/run/aerospike/asd.pid;proto-fd-max=30000;advertise-ipv6=false;batch-threads=24;batch-max-buffers-per-queue=255;batch-max-requests=5000;batch-max-unused-buffers=256;batch-priority=200;batch-index-threads=4;clock-skew-max-ms=1000;cluster-name=null;enable-benchmarks-fabric=false;enable-benchmarks-svc=false;enable-hist-info=false;hist-track-back=300;hist-track-slice=10;hist-track-thresholds=null;info-threads=16;ldt-benchmarks=false;log-local-time=false;migrate-max-num-incoming=4;migrate-threads=1;node-id-interface=null;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;paxos-max-cluster-size=32;paxos-protocol=v3;paxos-recovery-policy=auto-reset-master;paxos-retransmit-period=5;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;query-batch-size=100;query-buf-size=2097152;query-bufpool-size=256;query-in-transaction-thread=false;query-long-q-max-size=500;query-microbenchmark=false;query-pre-reserve-partitions=false;query-priority=10;query-priority-sleep-us=1;query-rec-count-bound=18446744073709551615;query-req-in-query-thread=false;query-req-max-inflight=100;query-short-q-max-size=500;query-threads=6;query-threshold=10;query-untracked-time-ms=1000;query-worker-threads=15;respond-client-on-master-completion=false;run-as-daemon=true;scan-max-active=100;scan-max-done=100;scan-max-udf-transactions=32;scan-threads=4;service-threads=24;sindex-builder-threads=4;sindex-gc-enable-histogram=false;ticker-interval=10;transaction-max-ms=1000;transaction-pending-limit=0;transaction-queues=4;transaction-repeatable-read=false;transaction-retry-ms=1002;transaction-threads-per-queue=4;work-directory=/opt/aerospike;write-duplicate-resolution-disable=false;fabric-dump-msgs=false;max-msgs-per-type=-1;memory-accounting=false;prole-extra-ttl=0;non-master-sets-delete=false

I changed transaction-pending-limit to 0 because i was getting lot of HOT KEY error.

Aerospike Log with enable-benchmarks-write=true:

Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:170) NODE-ID bb9e047fb902500 CLUSTER-SIZE 14
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:253)    system-memory: free-kbytes 4714768 free-pct 7 heap-kbytes (57819204,61686548,64284672) heap-efficiency-pct 89.9
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:267)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 1 proxy-hash 0 tree-gc-q 0
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:289)    fds: proto (1861,206933439,206931578) heartbeat (13,351,338) fabric (312,1395,1083)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:298)    heartbeat-received: self 0 foreign 154602263
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:328)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (1971357,1929931)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:352)    early-fail: demarshal 0 tsvc-client 14 tsvc-batch-sub 0 tsvc-udf-sub 0
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:369)    batch-index: batches (712022,0,84)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: batch-index (712022 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0000122597) (01: 0000557767) (02: 0000030778) (03: 0000000359)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0000000159) (05: 0000000088) (06: 0000000050) (07: 0000000044)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (08: 0000000047) (09: 0000000052) (10: 0000000081)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:382) {test} objects: all 0 master 0 prole 0
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:443) {test} migrations: complete
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:462) {test} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:382) {production} objects: all 259531214 master 131518951 prole 128012263
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:443) {production} migrations: complete
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:462) {production} memory-usage: total-bytes 49129298243 index-bytes 16609997696 sindex-bytes 10965855 data-bytes 32508334692 used-pct 77.55
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:492) {production} device-usage: used-bytes 73219840256 avail-pct 89
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:585) {production} client: tsvc (0,15290) proxy (505,0,9) read (2,0,0,0) write (111483291,12,2002) delete (0,0,0,0) udf (2424680314,7628358,55239) lang (189549,2422358178,2187475,2314437)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:613) {production} batch-sub: tsvc (0,19540) proxy (114,0,0) read (180258099,0,0,67570463)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:641) {production} scan: basic (4,32,0) aggr (0,0,0) udf-bg (0,0,0)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:666) {production} query: basic (9222787,0) aggr (0,0) udf-bg (0,0)
Oct 12 2017 13:23:03 GMT: INFO (info): (ticker.c:735) {production} retransmits: migration 5727948 client-read 0 client-write (0,0) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0) nsup 0
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-read (2 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (00: 0000000002)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write (111483303 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0111163416) (01: 0000156529) (02: 0000055374) (03: 0000039086)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0000032302) (05: 0000016204) (06: 0000006511) (07: 0000003644)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (08: 0000003405) (09: 0000003169) (10: 0000002968) (11: 0000000653)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (12: 0000000041) (13: 0000000001)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-start (34190 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0000033229) (01: 0000000405) (02: 0000000360) (03: 0000000169)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (04: 0000000025) (05: 0000000002)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-restart (0 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-dup-res (0 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-master (34190 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0000034123) (01: 0000000004) (02: 0000000012) (03: 0000000015)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0000000014) (05: 0000000011) (06: 0000000004) (08: 0000000002)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (09: 0000000001) (11: 0000000004)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-repl-write (34172 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0000034080) (01: 0000000006) (02: 0000000005) (03: 0000000008)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0000000006) (05: 0000000014) (06: 0000000011) (07: 0000000007)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (08: 0000000004) (09: 0000000015) (10: 0000000016)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-write-response (34172 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (00: 0000034172)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-udf (2432308672 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 2399887009) (01: 0006768612) (02: 0006740875) (03: 0010336102)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0006929781) (05: 0000970993) (06: 0000232649) (07: 0000141963)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (08: 0000104393) (09: 0000085639) (10: 0000088441) (11: 0000019321)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (12: 0000002828) (13: 0000000066)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-query (9222787 total) msec
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (00: 0005279450) (01: 0003932800) (02: 0000004531) (03: 0000001490)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (04: 0000001653) (05: 0000001141) (06: 0000000362) (07: 0000000259)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (08: 0000000224) (09: 0000000240) (10: 0000000303) (11: 0000000303)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:171)  (12: 0000000025) (13: 0000000006)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:145) histogram dump: {production}-query-rec-count (9198773 total) count
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (01: 0000000142) (02: 0000000456) (03: 0000002995) (04: 0000015958)
Oct 12 2017 13:23:03 GMT: INFO (info): (hist.c:162)  (05: 0000276484) (06: 0002762003) (07: 0004584236) (08: 0001556499)

I am doing mostly UDF and writes:

The batch Reads Graph seems incorrect. We are sending ~160-180K/s batch read (not batch index) with each batch of size 6-8.

Aslatency stats:

Write

$ asloglatency -h {production}-write
{production}-write
Oct 12 2017 13:17:33
               % > (ms)
slice-to (sec)      1      8     64    ops/sec
-------------- ------ ------ ------ ----------
13:17:43    10   3.34   0.35   0.00      575.7
13:17:53    10   4.13   0.49   0.04      525.5
13:18:03    10   5.63   0.97   0.35      692.6
13:18:13    10   6.24   0.25   0.05     5274.9
13:18:23    10   7.22   0.31   0.03     4861.1
13:18:33    10   6.02   0.23   0.01     4919.3
13:18:43    10   4.99   0.58   0.28     2721.2
13:18:53    10   2.58   0.30   0.04     4702.2
13:19:03    10   1.24   0.23   0.02     4445.3
13:19:13    10   2.11   0.39   0.15     4118.8
13:19:23    10   2.33   0.29   0.06     3332.7
13:19:33    10   2.82   0.26   0.01     4788.3
13:19:43    10   5.26   0.47   0.15     4391.4
13:19:53    10   9.24   0.30   0.02     3250.9
13:20:03    10   5.47   0.23   0.01     5391.2
13:20:13    10   1.65   0.22   0.02     5183.2
13:20:23    10   2.89   0.70   0.18     3563.4
13:20:33    10   2.27   0.71   0.06     4096.6
13:20:43    10   0.90   0.31   0.13     4155.1
13:20:53    10   1.95   0.30   0.12     4138.9
13:21:03    10   1.49   0.24   0.02     3078.7
13:21:13    10   0.80   0.23   0.07     5597.5
13:21:23    10   1.27   0.24   0.06     4628.2
13:21:33    10   1.69   0.23   0.05     3684.2
13:21:43    10   1.79   0.21   0.03     4623.0
13:21:53    10   2.17   0.17   0.01     5499.3
13:22:03    10   3.26   0.32   0.12     3607.9
13:22:13    10   0.75   0.20   0.02     4152.4
-------------- ------ ------ ------ ----------
     avg         3.27   0.35   0.08     3928.0
     max         9.24   0.97   0.35     5597.5

UDF:

$ asloglatency -N production -h udf
{production}-udf
Oct 12 2017 13:17:26
               % > (ms)
slice-to (sec)      1      8     64    ops/sec
-------------- ------ ------ ------ ----------
13:17:36    10   3.46   0.04   0.00      248.3
13:17:46    10   4.92   0.00   0.00      256.1
13:17:56    10   2.82   0.00   0.00      254.9
13:18:06    10   2.03   0.00   0.00      286.1
13:18:16    10   2.69   0.08   0.00      238.1
13:18:26    10   1.19   0.26   0.00      151.4
13:18:36    10   0.49   0.37   0.12       81.5
13:18:46    10   0.35   0.00   0.00       86.2
13:18:56    10   0.89   0.05   0.00      201.3
13:19:06    10   1.24   0.31   0.27      257.1
13:19:16    10   1.61   1.17   1.02      136.6
13:19:26    10   1.60   0.04   0.04      243.8
13:19:36    10   2.06   0.00   0.00      238.4
13:19:46    10   1.44   0.22   0.22      277.4
13:19:56    10   2.19   1.06   1.06      123.1
13:20:06    10   0.86   0.00   0.00      256.3
13:20:16    10   0.50   0.00   0.00      238.1
13:20:26    10   0.91   0.50   0.05      198.7
13:20:36    10   0.96   0.46   0.41      217.8
13:20:46    10   1.10   0.44   0.44      226.6
13:20:56    10   1.43   0.00   0.00      168.1
13:21:06    10   1.03   0.05   0.00      194.2
13:21:16    10   1.14   0.05   0.05      218.8
13:21:26    10   0.62   0.00   0.00      193.6
13:21:36    10   1.70   0.15   0.00      194.3
13:21:46    10   1.63   0.10   0.10      208.5
13:21:56    10   1.98   0.00   0.00      212.3
13:22:06    10   0.89   0.38   0.38      212.7
13:22:16    10   1.25   0.26   0.10      192.0
13:22:26    10   1.13   0.05   0.00      212.3
13:22:36    10   1.35   0.00   0.00      191.9
13:22:46    10   2.85   0.47   0.41      193.0
13:22:56    10   0.99   0.25   0.25      905.1

I have 14 client (GoLang) machine each have 2 different process. One batch read (~6-8 batch size each) and one writes. There are 3 separate machine (GoLang) for UDF.

I am seeing latency for batch read on client machine is high. I am assuming my cluster should easily handle 50-80k/s writes along with 4-5K/s UDF and 1-2K/s batch read. But even on 5-10K writes , latency on client machine for batch read exceeds ~100ms. I am not able to identify reason of slow performance ? I tried increasing batch-threads to 24 but no improvement :frowning:

Htop:

I have enabled logging on Go_client as well.

$ cat /var/log/supervisord-avatar.out* | grep '2017/10/13'

2017/10/13 08:08:32 Connection to address `10.124.117.88:3000` failed to establish with error: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:32 Node BB90857FB902500 10.124.117.88:3000: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:32 Connection to address `10.124.117.88:3000` failed to establish with error: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:32 Node BB90857FB902500 10.124.117.88:3000: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:33 Connection to address `10.124.117.88:3000` failed to establish with error: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:33 Node BB90857FB902500 10.124.117.88:3000: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:33 Connection to address `10.124.117.88:3000` failed to establish with error: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 08:08:33 Node BB90857FB902500 10.124.117.88:3000: dial tcp 10.124.117.88:3000: i/o timeout
2017/10/13 07:19:51 Tending took 1.10061868s, while your requested ClientPolicy.TendInterval is 1s. Tends are slower than the interval, and may be falling behind the changes in the cluster.
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:53 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:54 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:54 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:54 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:54 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:19:54 Node BB9A4083A7AC40C 10.57.115.42:3000: EOF
2017/10/13 07:31:42 Tending took 1.257162318s, while your requested ClientPolicy.TendInterval is 1s. Tends are slower than the interval, and may be falling behind the changes in the cluster.
2017/10/13 07:41:52 Tending took 1.282660632s, while your requested ClientPolicy.TendInterval is 1s. Tends are slower than the interval, and may be falling behind the changes in the cluster.
2017/10/13 07:04:41 No connections available; seeding...
2017/10/13 07:04:41 Seeding the cluster. Seeds count: 10
2017/10/13 07:04:41 Node BB97CD1657AC40C partition generation -2 changed to 2462
2017/10/13 07:04:41 Node BB9E0D0657AC40C partition generation -2 changed to 2757
2017/10/13 07:04:41 Tend finished. Live node count changes from 0 to 7
2017/10/13 07:04:41 Node BB9D4D0657AC40C partition generation -2 changed to 1234
2017/10/13 07:04:41 Node BB97858FB902500 partition generation -2 changed to 3747
2017/10/13 07:04:41 Node BB90857FB902500 partition generation -2 changed to 1268
2017/10/13 07:04:41 Node BB930D0657AC40C partition generation -2 changed to 3280
2017/10/13 07:04:41 Node BB940BFFA902500 partition generation -2 changed to 3593
2017/10/13 07:04:41 Tend finished. Live node count changes from 7 to 14
2017/10/13 07:04:41 Node BB9F0D0657AC40C partition generation -2 changed to 2859
2017/10/13 07:04:41 Node BB9E047FB902500 partition generation -2 changed to 3206
2017/10/13 07:04:41 Node BB9A4083A7AC40C partition generation -2 changed to 2856
2017/10/13 07:04:41 Node BB97CC4FA902500 partition generation -2 changed to 2886
2017/10/13 07:04:41 Node BB9E88A637AC40C partition generation -2 changed to 3207
2017/10/13 07:04:41 Node BB95CD5FA902500 partition generation -2 changed to 3755
2017/10/13 07:04:41 Node BB99C083A7AC40C partition generation -2 changed to 1811
2017/10/13 07:04:44 Connection to address `10.57.115.91:3000` failed to establish with error: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 07:04:44 Seed 10.57.115.91:3000 failed: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 07:06:32 No connections available; seeding...
2017/10/13 07:06:32 Seeding the cluster. Seeds count: 10
2017/10/13 07:06:32 Node BB99C083A7AC40C partition generation -2 changed to 1811
2017/10/13 07:06:32 Node BB940BFFA902500 partition generation -2 changed to 3593
2017/10/13 07:06:32 Node BB930D0657AC40C partition generation -2 changed to 3280
2017/10/13 07:06:32 Node BB9D4D0657AC40C partition generation -2 changed to 1234
2017/10/13 07:06:32 Node BB97858FB902500 partition generation -2 changed to 3747
2017/10/13 07:06:32 Node BB90857FB902500 partition generation -2 changed to 1268
2017/10/13 07:06:32 Tend finished. Live node count changes from 0 to 6
2017/10/13 07:06:32 Tend finished. Live node count changes from 6 to 14
2017/10/13 07:06:32 Node BB9A4083A7AC40C partition generation -2 changed to 2856
2017/10/13 07:06:32 Node BB95CD5FA902500 partition generation -2 changed to 3755
2017/10/13 07:06:32 Node BB9F0D0657AC40C partition generation -2 changed to 2859
2017/10/13 07:06:32 Node BB9E88A637AC40C partition generation -2 changed to 3207
2017/10/13 07:06:32 Node BB9E0D0657AC40C partition generation -2 changed to 2757
2017/10/13 07:06:32 Node BB9E047FB902500 partition generation -2 changed to 3206
2017/10/13 07:06:32 Node BB97CD1657AC40C partition generation -2 changed to 2462
2017/10/13 07:06:32 Node BB97CC4FA902500 partition generation -2 changed to 2886
2017/10/13 07:06:35 Connection to address `10.57.115.91:3000` failed to establish with error: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 07:06:35 Seed 10.57.115.91:3000 failed: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 07:12:03 No connections available; seeding...
2017/10/13 07:12:03 Seeding the cluster. Seeds count: 10
2017/10/13 07:12:03 Node BB9E0D0657AC40C partition generation -2 changed to 2757
2017/10/13 07:12:03 Node BB930D0657AC40C partition generation -2 changed to 3280
2017/10/13 07:12:03 Node BB9D4D0657AC40C partition generation -2 changed to 1234
2017/10/13 07:12:03 Node BB940BFFA902500 partition generation -2 changed to 3593
2017/10/13 07:12:03 Node BB97858FB902500 partition generation -2 changed to 3747
2017/10/13 07:12:03 Node BB90857FB902500 partition generation -2 changed to 1268
2017/10/13 07:12:03 Tend finished. Live node count changes from 0 to 6
2017/10/13 07:12:03 Tend finished. Live node count changes from 6 to 14
2017/10/13 07:12:03 Node BB9A4083A7AC40C partition generation -2 changed to 2856
2017/10/13 07:12:03 Node BB97CD1657AC40C partition generation -2 changed to 2462
2017/10/13 07:12:03 Node BB9E88A637AC40C partition generation -2 changed to 3207
2017/10/13 07:12:03 Node BB95CD5FA902500 partition generation -2 changed to 3755
2017/10/13 07:12:03 Node BB9E047FB902500 partition generation -2 changed to 3206
2017/10/13 07:12:03 Node BB97CC4FA902500 partition generation -2 changed to 2886
2017/10/13 07:12:03 Node BB9F0D0657AC40C partition generation -2 changed to 2859
2017/10/13 07:12:03 Node BB99C083A7AC40C partition generation -2 changed to 1811
2017/10/13 07:12:06 Connection to address `10.57.115.91:3000` failed to establish with error: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 07:12:06 Seed 10.57.115.91:3000 failed: dial tcp 10.57.115.91:3000: getsockopt: no route to host
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.
2017/10/13 09:02:32 Node BB90857FB902500 10.124.117.88:3000: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.

How can i debug it? Or my expectations are too high ?

I would recommend isolating the different workloads and performance benchmarking them independently to get an idea of the cluster’s limits. Especially benchmarking your write workload on its own, maybe using a different client, like the java benchmark tool for example (super easy to get going, and at least you would have a quick idea). Nothing obvious from the write histograms… only very few seem to take more than 1 ms (at least up to the point where you captured). Expected to see some higher latencies in the write-repl-write histogram (going to prole side over network), but surprised to see some in the write-master histogram going that high, especially given that you are data in memory… but, if you do have hotkeys on read, then it would cause some contention on the read part of the write (assuming updates). Also, even if transaction-pending-limit is set to 0, if you really had that many hotkeys, that will still flood the network between client and server.

Again, I would probably recommend trying with a bare benchmark tool like the java one and push the write load without any hotkeys.

I will try benchmark tool.

The batch Reads Graph seems incorrect. We are sending ~160-180K/s batch read (not batch index) with each batch of size 6-8.

what about this?

And one more question: RAM uses was high on all nodes so i added 1 more node but system_free_mem_pct on each machine remains same. It was 2-5% on each machine and still same except newly added machine (10%). when i restarted one of node than system_free_mem_pct decreased on that node but still same on all other nodes. Is it expected ? Is there a way to reduce it without restarting each node?

The batch reads graph I believe only show batch-index transactions, so if you are doing batch-direct, they wouldn’t show up.

Regarding RAM usage, I wouldn’t necessary look too much into the system_free_mem_pct, since LINUX will usually use RAM for caching, especially if you have a large file in use (which is the case for you as you have a ~1TB file configured). I would make sure there is enough free memory outside of caches / buffers and make sure the system doesn’t swap which would impact performance.