Invalid_Node_Error (Code -3) after some writes

Pankaj7003 · March 19, 2019, 6:38am

We’ve a 4 node cluster Aerospike (build 3.12.0 community addition). Our cluster is working fine for read/writes except when we try to write at a high rate through java client via spark job. While executing the job the writes are happening at 500 TPS on all nodes as seen in AMC. But either the job fails after sometime or even if the job passes the second job fails within few seconds with “com.aerospike.client.AerospikeException$InvalidNode: Error Code -3: Invalid node”. Subsequent jobs have the same error for quite sometime (some hours) before next job can write any data. At the time of failure of job cluster seems healthy on all checked metrics, viz. client connection, open connection, pending IO tasks etc.

Following is the config for

namespace {name_space_name} {
        replication-factor 1
        memory-size 50G
		default-ttl 10D
		storage-engine device {
			file /storage/aerospike/{file_name}.dat
			filesize 350G
			data-in-memory true
        }
}

kporter · March 19, 2019, 7:26am

The rest of your config would be more helpful. Likely the old paxos or heartbeat implementations hit an issue. In your version, you can upgrade the heartbeat protocol to v3, which may help. But I suggest upgrading to 3.13 instead. 3.13 reworks much of the distributed system. Also since you must upgrade through 3.13 from prior versions, it has had large extension to the period where we backport bug fixes. You can find instructions here: https://www.aerospike.com/docs/operations/upgrade/cluster_to_3_13/.

Additional information about 3.13 can be found here: What’s New in Aerospike 3.13 and 3.14? | Aerospike.

Pankaj7003 · March 19, 2019, 7:39am

Thanks for speedy reply @kporter. Will upgrade the heartbeat protocol and try out first as cluster upgrade is comparatively bigger task. Meanwhile below is rest of the config if helpful:

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002

                mesh-seed-address-port {ip1} 3002
                mesh-seed-address-port {ip2} 3002
                mesh-seed-address-port {ip3} 3002
                mesh-seed-address-port {ip4} 3002

                interval 150
                timeout 20
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

service {


        user {username}
        group {group_name}

        nsup-period 100
        paxos-single-replica-limit 1
        service-threads 20
        transaction-queues 20
        transaction-threads-per-queue 3
        transaction-pending-limit 15000
        proto-fd-max 50000
        migrate-threads 1
        pidfile /var/run/aerospike/asd.pid
}

Pankaj7003 · March 19, 2019, 12:40pm

Unfortunately it didn’t help out and its still same. Also, what I missed in last post was that if we restart the Aerospike nodes, cluster again starts to accept bulk writes. Please let me know if upgrading would be the only viable option.

meher · March 24, 2019, 6:22pm

This error -3 invalid node could be caused by attempting connections without having the cluster object instantiated (no nodes have been discovered in the cluster). Also, are you using the latest Java Client?

Pankaj7003 · April 2, 2019, 6:09am

attempting connections without having the cluster object instantiated

In that case there shouldn’t be any updates going at all, but writes do go for sometime and then fails. BTW, when would this scenario happen ?

Also, are you using the latest Java Client?

We’re using “3.3.2” Java Client. Will check if there are no breaking changes in the latest client.

meher · April 3, 2019, 1:17am

I am not an expert on the client coding best practices, but I guess it could happen on how the cluster object is initialized / re-initialized or when recovering from a short lived network outage?

I would definitely suggest trying the latest client. There would be some changes to apply but I don’t think it is much. The release notes would have links to the relevant details.

Nipun_Jain · August 17, 2019, 5:27am

@Pankaj7003 How’d you resolve this issue? I am facing the same issue. Using Community Edition - 4.0.19

Brian · August 19, 2019, 4:59pm

I recommend upgrading to the latest java client.

system · August 25, 2019, 5:10pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aerospike C# client Error Code -3 C# Client	5	4583	May 19, 2017
Error Code -3: Invalid node C# Client	2	1871	September 9, 2018
Aerospike::Exceptions::InvalidNode: Invalid node on Ruby Aerospike Client Ruby Client query , error , client	1	1379	February 1, 2021
Exception on node restart	12	1726	October 31, 2017
Error -3: Node not found for partition error Java Client	11	3081	June 2, 2021

Invalid_Node_Error (Code -3) after some writes

Related topics