One node performing poorly in cluster

shaheenm · July 30, 2015, 2:32am

Hello,

We are testing out aerospike and are trying to figure out how to troubleshoot a slow node. We are aiming for sub millisecond response time which we have been able to achieve before, however this cluster we are now testing is under the most load.

We are doing very simple by key look ups. When looking at the asmonitor latency read stats, I see:

                    timespan                  ops/sec  >1ms   >8ms    >64ms
10.1.109.77:3000    02:21:56-GMT->02:22:06    6357.7   0.98   0.00    0.00
10.1.111.71:3000    02:22:01-GMT->02:22:11    6109.4   6.37   3.35    1.00
10.1.111.72:3000    02:22:03-GMT->02:22:13    5932.2   0.91   0.06    0.00
10.1.111.73:3000    02:21:56-GMT->02:22:06    6012.0   0.92   0.00    0.00

As you can see the second node in the list is performing poorly compared to the rest. I have compared logs and don’t see that node doing anything unusual, any ideas for troubleshooting this?

Config:

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

namespace testing {
    replication-factor 2
    memory-size 8G
    high-water-memory-pct 70
    default-ttl 4d # 30 days, use 0 to never expire/evict.

    storage-engine device {
            device /dev/sdb

            # The 2 lines below optimize for SSD.
            scheduler-mode noop
            write-block-size 128K

            # Use the line below to store data in memory in addition to devices.
            # data-in-memory true
    }
}

Thanks in advance.

raj · July 30, 2015, 2:51pm

Can you enabled microbencmarks share the few lines of log

http://www.aerospike.com/logging-guide/

– R

shaheenm · July 30, 2015, 5:48pm

Every time I post a response with the log info, I feel like it gets dropped from the thread?

shaheenm · July 30, 2015, 5:49pm

Let me try a link:

gist.github.com

https://gist.github.com/smojtabai/105ce7f97698a313a779

gistfile1.txt

ul 30 2015 17:43:56 GMT: INFO (info): (hist.c::137) histogram dump: reads (298283678 total) msec
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::154)  (00: 0287105840) (01: 0001695888) (02: 0001544777) (03: 0001268800)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::154)  (04: 0001463600) (05: 0001511113) (06: 0001998778) (07: 0001579406)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::163)  (08: 0000106241) (09: 0000009216) (10: 0000000019)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (32788406 total) msec
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::154)  (00: 0032167047) (01: 0000202199) (02: 0000146880) (03: 0000061331)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::154)  (04: 0000056433) (05: 0000050017) (06: 0000053412) (07: 0000048808)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::163)  (08: 0000002182) (09: 0000000097)
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jul 30 2015 17:43:56 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (32788406 total) msec

This file has been truncated. show original

bayoukingpin · July 31, 2015, 7:44pm

There are a number of things this could be. If we assume the config you posted is the same for all servers, it looks like you are trying to store data on SSDs. There are a few things in your config that are non-standard, but are likely not causing this one node to have a problem:

The service threads and transaction-queues should match the number of core (hyperthreaded) on the server.
The high-water-mark is a bit high, but tolerable at 70%

The first thing to check is to see if the disks are configured the same.

Sorry to be anal retentive, but are they the same SSD model?
You should also check to see how the disks have been attached to the server. Is this through a RAID controller or direct to motherboard? If RAID, you should see if the cache settings are the same. Aerospike recommends NoReadAhead and WriteThrough.
Have they all been overprovisioned the same way? You can use hdparm 9.37+ by running “sudo hdparm -N /dev/sdb”. Please check to see if the slow node is configured the same way as the other.
Check iostat on the slow node and compare it to another one. I generally use “sudo iostat -x 2”. Disregard the first one or two outputs, as they are not accurate.

Topic		Replies	Views
Need Urgent help in tuning Production AS Tuning	1	1671	November 9, 2017
One node showing inexplicably high read latency/CPU load Tuning aws , migration	10	5802	October 28, 2015
Intermittent high latency Tuning	1	1547	April 17, 2018
CPU unusually high on one node of 8 node cluster	10	2931	April 17, 2017
Why Aerospike server is very slow Tuning	1	2618	March 17, 2015

One node performing poorly in cluster

Related topics