Client failures when a node is removed

ApsOps · September 4, 2017, 8:28am

We’re facing this issue since a long time now.

Whenever a node goes away (mostly when we remove an older node), we see lots of client errors until we restart all the client servers.

Error occurred while performing batch get in aerospike (9, 'Timeout: timeout=10000 iterations=1 failedNodes=0 failedConns=0', 'src/main/aerospike/as_command.c', 566)

I expect a few errors initially until clients refresh the server node IPs, but this keeps going on even after a couple of hours.

There should probably be a way to take a node out of server IPs list that goes to clients, so we can at least handle situations when we’re manually removing nodes.

Though a node going down due to hardware failure should also be handled. Otherwise what’s the point of having an HA cluster with multiple nodes?

Please suggest how we can deal with such situations for now.

Thanks!

Aerospike version: 3.12.1.1 CE

kporter · September 5, 2017, 11:27pm

This doesn’t sound normal.

Which Client are you using?
Please share your aerospike.conf.

ApsOps · September 8, 2017, 8:56am

Aerospike Python client 2.0.6
aerospike.conf

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

logging {
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 3002

        mesh-seed-address-port aerospike-1.example.com 3002
        mesh-seed-address-port aerospike-2.example.com 3002

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace cf {
    replication-factor 2
    memory-size 7G
    default-ttl 0

    storage-engine device {
        device /dev/nvme0n1
        write-block-size 128K
    }
}

I’m running two i3.large (2 cores, 15GB RAM, ~450GB nvme local SSD) EC2 instances on AWS.

kporter · September 9, 2017, 12:22am

Could you test with the latest python client?

Topic		Replies	Views
Removing a node without causing client failure How Aerospike Works	6	4467	September 23, 2016
Aerospike losing documents when node goes down Configuration aws	2	2014	September 22, 2015
Aerospike Exception Operations	4	1211	August 10, 2017
Automatic failover to other nodes? C Client Library	1	1776	July 27, 2016
AsyncClient (async 3.2.1 version) - getNodes() method returns "Node" already down Java Client	3	1419	May 5, 2016

Client failures when a node is removed

Related topics