Invalid_Node_Error (Code -3) after some writes

#1

We’ve a 4 node cluster Aerospike (build 3.12.0 community addition). Our cluster is working fine for read/writes except when we try to write at a high rate through java client via spark job. While executing the job the writes are happening at 500 TPS on all nodes as seen in AMC. But either the job fails after sometime or even if the job passes the second job fails within few seconds with “com.aerospike.client.AerospikeException$InvalidNode: Error Code -3: Invalid node”. Subsequent jobs have the same error for quite sometime (some hours) before next job can write any data. At the time of failure of job cluster seems healthy on all checked metrics, viz. client connection, open connection, pending IO tasks etc.

Following is the config for

namespace {name_space_name} {
        replication-factor 1
        memory-size 50G
		default-ttl 10D
		storage-engine device {
			file /storage/aerospike/{file_name}.dat
			filesize 350G
			data-in-memory true
        }
}
#2

The rest of your config would be more helpful. Likely the old paxos or heartbeat implementations hit an issue. In your version, you can upgrade the heartbeat protocol to v3, which may help. But I suggest upgrading to 3.13 instead. 3.13 reworks much of the distributed system. Also since you must upgrade through 3.13 from prior versions, it has had large extension to the period where we backport bug fixes. You can find instructions here: https://www.aerospike.com/docs/operations/upgrade/cluster_to_3_13/.

Additional information about 3.13 can be found here: https://www.aerospike.com/blog/whats-new-aerospike-3-13-3-14/.

#3

Thanks for speedy reply @kporter. Will upgrade the heartbeat protocol and try out first as cluster upgrade is comparatively bigger task. Meanwhile below is rest of the config if helpful:

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002

                mesh-seed-address-port {ip1} 3002
                mesh-seed-address-port {ip2} 3002
                mesh-seed-address-port {ip3} 3002
                mesh-seed-address-port {ip4} 3002

                interval 150
                timeout 20
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

service {


        user {username}
        group {group_name}

        nsup-period 100
        paxos-single-replica-limit 1
        service-threads 20
        transaction-queues 20
        transaction-threads-per-queue 3
        transaction-pending-limit 15000
        proto-fd-max 50000
        migrate-threads 1
        pidfile /var/run/aerospike/asd.pid
}
#4

Unfortunately it didn’t help out and its still same. Also, what I missed in last post was that if we restart the Aerospike nodes, cluster again starts to accept bulk writes. Please let me know if upgrading would be the only viable option.

#5

This error -3 invalid node could be caused by attempting connections without having the cluster object instantiated (no nodes have been discovered in the cluster). Also, are you using the latest Java Client?

#6

attempting connections without having the cluster object instantiated

In that case there shouldn’t be any updates going at all, but writes do go for sometime and then fails. BTW, when would this scenario happen ?

Also, are you using the latest Java Client?

We’re using “3.3.2” Java Client. Will check if there are no breaking changes in the latest client.

#7

I am not an expert on the client coding best practices, but I guess it could happen on how the cluster object is initialized / re-initialized or when recovering from a short lived network outage?

I would definitely suggest trying the latest client. There would be some changes to apply but I don’t think it is much. The release notes would have links to the relevant details.