Node Not Found For Partition error after a full restart of strongly consistent namespace

Node Not Found For Partition error after a full restart of a cluster with strong-consistency enabled namespaces

Problem Description

A cluster where namespaces are configured with strong-consistency is fully restarted. After this all clients report the following error:

com.aerospike.client.AerospikeException$InvalidNode: Error -3,1,0,30000,0,2: Node not found for partition namespace1:315

There are no connectivity errors or failures within the client tend thread.

Explanation

‘Node not found for partition’ is a very specific client error and occurs in the following situations:

  1. There is no entry for the specified partition in the client partition map.

  2. The regime returned by a node is lower than the regime stored for the partition on the client, thus causing the client to reject the partition ownership claim from such node.

In the first case, the most common cause is a connectivity error between the client and the nodes within the cluster when the client starts and is building the partition map from scratch. This prevents the clients tending to those nodes and means that it is not able to retrieve the list of partitions those nodes own. The net result is a gap in the partition map which can result in Node not found for partition when transactions for the partitions are attempted.

In the second case, connectivity is fine. The reason the client throws Node not found for partition is due to a mismatch between the regime the client has and the regime of the partition the node is presenting.

The regime of the cluster is shown in the logs.

{ns_name} rebalanced: regime 295 expected-migrations (826,826,826) expected-appeals 0 unavailable-partitions 425

In the situation under consideration here, what has happened is that the regime has reset when the cluster restarted, due to the storage being reset.

The Aerospike Smart Client holds a value for the current regime. The client has not restarted and so it holds a higher regime than the cluster is advertising. When the client tends, it will not accept an ownership claim from a node with a lower regime than the client, so there is a gap in the partition map. In the same way as with a failed tend, when the transaction is attempted, there is no node found for that partition in the partition map.

It is important to understand that this is not a bug or an error. This is the Aerospike strong consistency guarantee in action. If the regime for a partition is old, the client would potentially get a stale read and this would violate session level consistency.

In this instance, allowing the transaction would be extremely dangerous, as the partitions, though empty, are all available. The client uses regime to check this and behave appropriately.

Solution

If the restart of the cluster is understood and correct actions have been taken, the client can be made to reconnect by restarting the client instance. This resets the regime on the client and allows it to rebuild the partition map when it tends. It is important not to do this without examining why the cluster restarted and has its storage reset. Ultimately, this warning exists to protect consistency and should not be simply overridden without consideration.

Notes

  • This article discusses Node Not Found For Partition in strong consistency
  • This article discusses Node Not Found For Partition in a more general sense.
  • The client will not accept an ownership claim from a node with a lower regime however a higher regime is not an issue.

Applies To

Aerospike 4.0 and later.

Keywords

STRONG CONSISTENCY NODE NOT FOUND PARTITION REGIME

Timestamp

January 2022