We’ve a 4 node cluster Aerospike (build 3.12.0 community addition). Our cluster is working fine for read/writes except when we try to write at a high rate through java client via spark job.
While executing the job the writes are happening at 500 TPS on all nodes as seen in AMC. But either the job fails after sometime or even if the job passes the second job fails within few seconds with “com.aerospike.client.AerospikeException$InvalidNode: Error Code -3: Invalid node”. Subsequent jobs have the same error for quite sometime (some hours) before next job can write any data.
At the time of failure of job cluster seems healthy on all checked metrics, viz. client connection, open connection, pending IO tasks etc.
The rest of your config would be more helpful. Likely the old paxos or heartbeat implementations hit an issue. In your version, you can upgrade the heartbeat protocol to v3, which may help. But I suggest upgrading to 3.13 instead. 3.13 reworks much of the distributed system. Also since you must upgrade through 3.13 from prior versions, it has had large extension to the period where we backport bug fixes. You can find instructions here: https://www.aerospike.com/docs/operations/upgrade/cluster_to_3_13/.
Thanks for speedy reply @kporter. Will upgrade the heartbeat protocol and try out first as cluster upgrade is comparatively bigger task. Meanwhile below is rest of the config if helpful:
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002
mesh-seed-address-port {ip1} 3002
mesh-seed-address-port {ip2} 3002
mesh-seed-address-port {ip3} 3002
mesh-seed-address-port {ip4} 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
service {
user {username}
group {group_name}
nsup-period 100
paxos-single-replica-limit 1
service-threads 20
transaction-queues 20
transaction-threads-per-queue 3
transaction-pending-limit 15000
proto-fd-max 50000
migrate-threads 1
pidfile /var/run/aerospike/asd.pid
}
Unfortunately it didn’t help out and its still same. Also, what I missed in last post was that if we restart the Aerospike nodes, cluster again starts to accept bulk writes. Please let me know if upgrading would be the only viable option.
This error -3 invalid node could be caused by attempting connections without having the cluster object instantiated (no nodes have been discovered in the cluster). Also, are you using the latest Java Client?
I am not an expert on the client coding best practices, but I guess it could happen on how the cluster object is initialized / re-initialized or when recovering from a short lived network outage?
I would definitely suggest trying the latest client. There would be some changes to apply but I don’t think it is much. The release notes would have links to the relevant details.