Write Performance Problem during Migrations while adding a new Node

#1

Hi Community,

I have got a 4-Node Cluster running CE 3.14.1.2 on Intel DC P3700 NVME Devices and Ubuntu 14.04 Now I want to add a new Node, running CE 4.2.0.5 on Intel Optane 4800x Devices and Ubuntu 18.04

All 5 Nodes have got the same 2x14 Core XEON CPU, 256GB RAM, 10GIG Ethernet

The Normal Workload is about 500K TPS Read and 200K TPS Write

After starting the Node, the Write Latencs is increasing dramatically, so the clients get about 10% 200ms timeouts.

After stopping the Node, the 4-Node Cluster starts the rebuild (migration) with nearly zero timeouts

asbenchmark shows the following results:

aerospike-server-community-4.2.0.5-ubuntu18.04# Benchmark: 127.0.0.1 3000, namespace: test, set: testset, threads: 28, workload: READ_UPDATE
read: 50% (all bins: 100%, single bin: 0%), write: 50% (all bins: 100%, single bin: 0%)
keys: 1000000, start key: 0, transactions: 10000000, bins: 1, random values: false, throughput: unlimited
read policy:
    socketTimeout: 0, totalTimeout: 0, maxRetries: 2, sleepBetweenRetries: 0
    consistencyLevel: CONSISTENCY_ONE, replica: SEQUENCE, reportNotFound: false
write policy:
    socketTimeout: 0, totalTimeout: 0, maxRetries: 0, sleepBetweenRetries: 0
    commitLevel: COMMIT_ALL
Sync: connPoolsPerNode: 1
bin[0]: integer
debug: true

2019-04-08 14:33:59.471 write(tps=169167 timeouts=0 errors=0) read(tps=169056 timeouts=0 errors=0) total(tps=338223 timeouts=0 errors=0)
2019-04-08 14:34:00.472 write(tps=171774 timeouts=0 errors=0) read(tps=171856 timeouts=0 errors=0) total(tps=343630 timeouts=0 errors=0)
2019-04-08 14:34:01.472 write(tps=170508 timeouts=0 errors=0) read(tps=171653 timeouts=0 errors=0) total(tps=342161 timeouts=0 errors=0)
2019-04-08 14:34:02.473 write(tps=166346 timeouts=0 errors=0) read(tps=166596 timeouts=0 errors=0) total(tps=332942 timeouts=0 errors=0)
2019-04-08 14:34:03.475 write(tps=167878 timeouts=0 errors=0) read(tps=167603 timeouts=0 errors=0) total(tps=335481 timeouts=0 errors=0)

All 5 Nodes have got the same config:

service {
        user root
 
  group root
        
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
      
  pidfile /var/run/aerospike/asd.pid
       
 service-threads 28
        
transaction-queues 28
        
transaction-threads-per-queue 4
        
proto-fd-max 15000
        
migrate-threads 3
       
 migrate-max-num-incoming 12
}

What can I do to find the bottleneck?

#2

Try setting migrate threads to 1.

asadm -e "asinfo -v 'set-config:context=service;migrate-threads=1'"

The threads will need to finish the partition they are currently migrating before they can exit.

#3

I also tested with migrate-threads=1 and the other:

migrate-max-num-incoming=1

migrate-sleep=1

#4

Here are some ACT Results:

output.txt is ACT version 5.0

ACT-STORAGE CONFIGURATION
device-names: /dev/nvme0n1 /dev/nvme1n1
num-devices: 2
service-threads: 56
num-queues: 56
threads-per-queue: 4
test-duration-sec: 86400
report-interval-sec: 1
microsecond-histograms: no
read-reqs-per-sec: 400000
write-reqs-per-sec: 200000
record-bytes: 1536
record-bytes-range-max: 0
large-block-op-kbytes: 128
replication-factor: 1
update-pct: 0
defrag-lwm-pct: 50
commit-to-device: no
commit-min-bytes: 0
tomb-raider: no
tomb-raider-sleep-usec: 0
max-reqs-queued: 100000
max-lag-sec: 10
scheduler-mode: noop

DERIVED CONFIGURATION
record-stored-bytes: 1536 ... 1536
internal-read-reqs-per-sec: 400000
internal-write-reqs-per-sec: 0
read-req-threads: 56
write-req-threads: 0
large-block-reads-per-sec: 4705.88
large-block-writes-per-sec: 4705.88

HISTOGRAM NAMES
reads
device-reads
/dev/nvme0n1-reads
/dev/nvme1n1-reads
large-block-reads
large-block-writes

        reads                                              device-reads
        %>(ms)                                             %>(ms)
slice        1      2      4      8     16     32     64        1      2      4      8     16     32     64
-----   ------ ------ ------ ------ ------ ------ ------   ------ ------ ------ ------ ------ ------ ------
    1    73.33  35.22   0.01   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    2    73.39  35.18   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    3    73.25  34.99   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    4    73.33  35.08   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    5    73.33  35.10   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    6    73.37  35.17   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    7    73.41  35.27   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    8    73.35  35.19   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
    9    73.35  35.13   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
-----   ------ ------ ------ ------ ------ ------ ------   ------ ------ ------ ------ ------ ------ ------
  avg    73.35  35.15   0.00   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
  max    73.41  35.27   0.01   0.00   0.00   0.00   0.00     0.00   0.00   0.00   0.00   0.00   0.00   0.00
#5

Would it be possible to also run ACT on one of the original nodes as well?

There was a histogram correction in 4.4 which can make it appear that latencies are worse when in fact they are the same. Because of this correction, your timeouts are more accurate and therefore slightly more aggressive - this could be causing your extra timeouts. See Write Latency Increase After Upgrade to Aerospike 4.4 or higher for details.

#6

I have got some Histogram Data. You can seet, that the write-restart and write-repl-write values are very high. After stopping the new Node, the whole cluster gets back in nearly 0ms response times.

    Apr 09 2019 22:40:18 GMT: INFO (info): (ticker.c:884) {rtblab} special-errors: key-busy 60163 record-too-big 60
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read (1445455775 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 1445421595) (01: 0000015529) (02: 0000011882) (03: 0000006593)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000176)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-start (1445455777 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 1445423015) (01: 0000014554) (02: 0000011595) (03: 0000006447)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000166)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-restart (0 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-dup-res (0 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-repl-ping (0 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-local (1445455779 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 1445454931) (01: 0000000589) (02: 0000000193) (03: 0000000059)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000007)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-read-response (1445455779 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 1445455683) (01: 0000000049) (02: 0000000022) (03: 0000000025)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write (228655197 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0187723300) (01: 0009192659) (02: 0008223924) (03: 0003665270)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (04: 0003055398) (05: 0002116116) (06: 0002763157) (07: 0004710945)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (08: 0006339713) (09: 0000851616) (10: 0000013099)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-start (264709407 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0264703613) (01: 0000002601) (02: 0000001906) (03: 0000001238)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (04: 0000000029) (05: 0000000006) (06: 0000000008) (07: 0000000003)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (08: 0000000003)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-restart (26469005 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0010708228) (01: 0001506141) (02: 0001225920) (03: 0000894694)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (04: 0000995758) (05: 0001191419) (06: 0002145077) (07: 0003876114)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (08: 0003631694) (09: 0000293960)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-dup-res (0 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-master (256458610 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0256458247) (01: 0000000272) (02: 0000000073) (03: 0000000017)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000001)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-repl-write (211625771 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0178382463) (01: 0009595538) (02: 0007735045) (03: 0003272222)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (04: 0002627339) (05: 0001481479) (06: 0001573362) (07: 0002432604)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (08: 0003927606) (09: 0000586710) (10: 0000011403)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-write-response (228655197 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0228655169) (01: 0000000012) (02: 0000000006) (03: 0000000010)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-device-read-size (1108949458 total) bytes
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (10: 0812764001) (11: 0286121393) (12: 0006382386) (13: 0002320274)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (14: 0000990826) (15: 0000306172) (16: 0000058582) (17: 0000005163)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (18: 0000000661)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-device-write-size (2694932607 total) bytes
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (07: 1424592389) (08: 0968278861) (09: 0198560957) (10: 0061463346)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (11: 0021711986) (12: 0010377295) (13: 0005891306) (14: 0002955286)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (15: 0000935093) (16: 0000151664) (17: 0000014355) (18: 0000000069)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme0n1-read (554549285 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0554547342) (01: 0000001123) (02: 0000000692) (03: 0000000113)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000015)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme0n1-large-block-read (735303 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (00: 0000735301) (01: 0000000002)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme0n1-write (2417197 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (00: 0002417193) (01: 0000000002) (02: 0000000002)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme1n1-read (554400185 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:257)  (00: 0554398237) (01: 0000001117) (02: 0000000651) (03: 0000000154)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (04: 0000000026)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme1n1-large-block-read (746035 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (00: 0000746032) (01: 0000000002) (02: 0000000001)
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:240) histogram dump: {rtblab}-/dev/nvme1n1-write (2429904 total) msec
Apr 09 2019 22:40:18 GMT: INFO (info): (hist.c:266)  (00: 0002429898) (01: 0000000003) (02: 0000000003)
#7
  1. Is the new node on a different rack than the original 4?
  2. The AMC chart you provide, is the top line the new node? If so, it seems to be getting 2-3x more writes than the others.
  3. Are these logs from the new node? The latencies appear to be coming from ‘{rtblab}-write-repl-write’ which typically indicates a network bottleneck.

Could you provide the output of sar -A.