Client failures when a node is removed


#1

We’re facing this issue since a long time now.

Whenever a node goes away (mostly when we remove an older node), we see lots of client errors until we restart all the client servers.

Error occurred while performing batch get in aerospike (9, 'Timeout: timeout=10000 iterations=1 failedNodes=0 failedConns=0', 'src/main/aerospike/as_command.c', 566)

I expect a few errors initially until clients refresh the server node IPs, but this keeps going on even after a couple of hours.

There should probably be a way to take a node out of server IPs list that goes to clients, so we can at least handle situations when we’re manually removing nodes.

Though a node going down due to hardware failure should also be handled. Otherwise what’s the point of having an HA cluster with multiple nodes?

Please suggest how we can deal with such situations for now.

Thanks!

Aerospike version: 3.12.1.1 CE


#2

This doesn’t sound normal.

  1. Which Client are you using?
  2. Please share your aerospike.conf.

#3
  1. Aerospike Python client 2.0.6
  2. aerospike.conf
service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

logging {
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 3002

        mesh-seed-address-port aerospike-1.example.com 3002
        mesh-seed-address-port aerospike-2.example.com 3002

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace cf {
    replication-factor 2
    memory-size 7G
    default-ttl 0

    storage-engine device {
        device /dev/nvme0n1
        write-block-size 128K
    }
}

I’m running two i3.large (2 cores, 15GB RAM, ~450GB nvme local SSD) EC2 instances on AWS.


#4

Could you test with the latest python client?