One node performing poorly in cluster


#1

Hello,

We are testing out aerospike and are trying to figure out how to troubleshoot a slow node. We are aiming for sub millisecond response time which we have been able to achieve before, however this cluster we are now testing is under the most load.

We are doing very simple by key look ups. When looking at the asmonitor latency read stats, I see:

                    timespan                  ops/sec  >1ms   >8ms    >64ms
10.1.109.77:3000    02:21:56-GMT->02:22:06    6357.7   0.98   0.00    0.00
10.1.111.71:3000    02:22:01-GMT->02:22:11    6109.4   6.37   3.35    1.00
10.1.111.72:3000    02:22:03-GMT->02:22:13    5932.2   0.91   0.06    0.00
10.1.111.73:3000    02:21:56-GMT->02:22:06    6012.0   0.92   0.00    0.00

As you can see the second node in the list is performing poorly compared to the rest. I have compared logs and don’t see that node doing anything unusual, any ideas for troubleshooting this?

Config:

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

namespace testing {
    replication-factor 2
    memory-size 8G
    high-water-memory-pct 70
    default-ttl 4d # 30 days, use 0 to never expire/evict.

    storage-engine device {
            device /dev/sdb

            # The 2 lines below optimize for SSD.
            scheduler-mode noop
            write-block-size 128K

            # Use the line below to store data in memory in addition to devices.
            # data-in-memory true
    }
}

Thanks in advance.


One node showing inexplicably high read latency/CPU load
#2

Can you enabled microbencmarks share the few lines of log

– R


#6

Every time I post a response with the log info, I feel like it gets dropped from the thread?


#8

Let me try a link:


#9

There are a number of things this could be. If we assume the config you posted is the same for all servers, it looks like you are trying to store data on SSDs. There are a few things in your config that are non-standard, but are likely not causing this one node to have a problem:

  • The service threads and transaction-queues should match the number of core (hyperthreaded) on the server.
  • The high-water-mark is a bit high, but tolerable at 70%

The first thing to check is to see if the disks are configured the same.

  1. Sorry to be anal retentive, but are they the same SSD model?
  2. You should also check to see how the disks have been attached to the server. Is this through a RAID controller or direct to motherboard? If RAID, you should see if the cache settings are the same. Aerospike recommends NoReadAhead and WriteThrough.
  3. Have they all been overprovisioned the same way? You can use hdparm 9.37+ by running “sudo hdparm -N /dev/sdb”. Please check to see if the slow node is configured the same way as the other.
  4. Check iostat on the slow node and compare it to another one. I generally use “sudo iostat -x 2”. Disregard the first one or two outputs, as they are not accurate.