How to troubleshoot `Node not found for partition` errors

FAQ – How to troubleshoot Node not found for partition errors

Context

How to troubleshoot when a client receives a Node not found for partition error.

For example, for the Java Client application the exception will look like:

Aerospike.Client.AerospikeException+InvalidNode: Error -3: Node not found for partition testnamespace:1095

The AEROSPIKE_ERR_INVALID_NODE means the client doesn’t have any node mapping for a given partition (partition ID 1095 in the namespace named testnamespace from the above example). This error is not common. It could indicate some issues (could be connectivity) preventing a client from properly ‘tending’ to all nodes in the cluster (which is how a client figures out which partition each node owns), or a node being marked as inactive by the client, as a client would drop all the partitions mapped for a node that becomes inactive. An inactive node (a node that has left a cluster and is not reported as a peer by any other node) would have the remaining nodes in the cluster take ownership of the partitions it had, and should therefore, under normal circumstances, not cause this error on the client side.

The AEROSPIKE_ERR_INVALID_NODE error is mapped to error code -3 but for older client versions, it is mapped to error code -8. For the C client library, it is still mapped to -8.

In general, error codes > 0 are server generated error codes and error codes < 0 are client generated error codes. There are some exceptions like AEROSPIKE_ERR_TIMEOUT (9) which can be generated by both client and server. Server PARTITION_UNAVAILABLE (11) is equivalent to AEROSPIKE_ERR_CLUSTER (11). AEROSPIKE_ERR_INVALID_NODE (-8 or -3) and AEROSPIKE_ERR_CONNECTION (-10) are only generated by the client.

Method

  1. If this is strong consistency namespace, please follow the Node not found in Strong Consistency article.

  2. Check for any known bugs in release notes, for example:

  • Server Bug fixed in version 3.14.1.1:
[AER-5690] - (CLUSTER) Server may return non-empty partition ownership map to client before initial rebalance, resulting in transactions getting ‘unavailable’ error.
  • As of Java client version 4.1.10 hitting the above condition (which wouldn’t occur against servers with the fix) would return error code -3 “INVALID_NODE”.

  • For Java client versions before 4.2.0, the client would attempt a transaction against a random node when a mapping is not found, therefore, such errors would be masked on such older client versions:

CLIENT-1038 Throw InvalidNode exception instead of returning random node when master and prole nodes are unavailable for AP mode (now consistent with SC mode).
  • The C# client versions prior to 4.1.0 could lead to this error for transactions causing the partition map to be invoked from a thread other than the tend thread:
CLIENT-1386 Force volatile partition map reads when it occurs from a non cluster tend thread.
  1. Use the following commands to confirm the partition mappings for all the nodes in the cluster:
asadm -e "show pmap"

asadm -e "asinfo -v 'partition-info' -l"
  1. Use the following command to determine which nodes own which partitions:
asadm -e "asinfo -v 'replicas-all'"
  1. Use the explain command in aql to determine which nodes own the partitions the record belongs to. For example:
aql>  explain select * from MyNamespace.MySet where PK=12345
  1. Finally, check the cluster for any node(s) restart or cluster changes in general.

Keywords

ERROR CODE -3 -8 INVALID NODE NOT FOUND FOR PARTITION INVALID

Timestamp

November 2020

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.