How do Queries on Secondary Indexes Behave while Migrations are In-Progress?

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

(As of version 3.7.0.1, qNodes are deprecated. Queries now go to the nodes that are the master for each partition, and return results to the client.)

When an Aerospike smart client issues a secondary index query, the query runs in parallel against each node in the cluster. It waits for all nodes in the cluster to return results. Each node must process the query against the partitions that it owns. Because each partition is replicated according to the replication factor, the database must ensure each partition is processed once and only once across the entire cluster. The notion of ‘qNode’ is introduced to achieve this. The qNode for a given partition is the node that holds the most records for that partition across the cluster. If all replicas of a partition hold the same number of records (which is always the case, except during migrations), the master node is the qNode for that partition. When a node processes a query, it only processes the qNode partitions.

Inconsistencies in query results may happen for a partition when its qNode flips from one node to another. The qNode flipping window is very small, but it happens every time a partition finishes migrating to another node which is master for it. For example, when an empty node joins a cluster, it is assigned partitions for which it is master. The partition is initially empty and must receive data from the other replica partitions in the cluster. When data migration is complete, and it has received all data, it becomes the qNode for that partition.

qNode flipping is not instantaneous. There are delays in node communications, and client requests do not reach all nodes simultaneously. Due to these gaps, there are small windows when nodes are migrating partitions, when a query could return duplicate records or no records from a node. There could be a small window where there are two qNodes for the same partition yielding duplicates for records matching the query for that partition, or when a partition does not have a qnode, yielding to missing records that would have matched the query for that partition.

We cannot further quantify the possibility of either multiple results from one qNode, or a partition that has no qNode and returns no results. We can only say that during migrations, there are up to 4096 (number of partitions) small windows in which a query might return duplicate records or might omit some records. The windows where this might happen are very small, and only occur in very specific circumstances during migrations.

Let’s illustrate with an example. For the sake of discussion, we will assume that we have a three-node cluster, and we are adding an empty 4th node. We will look at partition #1001.

A node joins the cluster:

  1. Node_1 is the master for partition #1001, Node_2 is empty, and Node_3 is the replica. Node_4 joins the cluster.
  2. Node_4 is designated as the master for partition #1001.
  3. The partition on Node_4 is the master, but the master is empty at this point. Node_3 becomes the qNode for partition #1001 because it contains the most data.
  4. The data for partition #1001 migrate to Node_4.
  5. When data migrations to Node_4 complete, Node_4 sends a ‘migration_end’ message to Node_3.
  6. Node_3 ceases to be qNode for partition #1001. Node_4 assumes qNode status for partition #1001 because it contains as much data as Node_3, and it is the master partition.

The following scenarios describe times when a query might return inaccurate results when a node is joining the cluster:

A query on a secondary index returns no results from a partition:

  1. Node_1 is the master, Node_2 is empty, and Node_3 is the replica. Node_4 joins the cluster.
  2. Node_4 is designated as the master for partition #1001.
  3. The partition on Node_4 is the master, but the master is empty. Node_3 becomes the qNode for partition #1001 because it contains the most data.
  4. The data for partition #1001 migrate to Node_4.
  5. When data migrations to Node_4 complete, Node_4 sends a ‘migration_end’ message to Node_3.
  6. Node_3 relinquishes its qNode status for partition #1001.
  7. A query runs and it goes to the Node_3 and Node_4 (and the other nodes) to be processed by all partitions for which they claim qNode status. For partition #1001, neither Node_3 nor Node_4 has yet declared itself the qNode, so neither node returns query results.
  8. Node_4 claims status of qNode. But the query has already been processed.

A query on a secondary index returns duplicate results from a partition:

  1. Node_1 is the master, Node_2 is empty, and Node_3 is the replica. Node_4 joins the cluster.
  2. Node_4 is designated as the master for partition #1001.
  3. Node_4 is the master for partition #1001, but the master is empty. The Node_3 becomes the qNode because the replica partition contains the most data.
  4. The data migrate to Node_4.
  5. When data migrations to Node_4 complete, Node_4 sends a ‘migration_end’ message to Node_3.
  6. A query runs and it goes to Node_3 to be processed by all partitions for which it claims qNode status. Node_3 is still the qNode for partition #1001. The query returns results from Node_3.
  7. Node_3 gives up its qNode status for partition #1001, and Node_4 takes over as qNode.
  8. The query comes to Node_4 after it has assumed qNode status for partition #1001. The query returns data for partition #1001 which was already processed in Node_3.

Can someone explain what would be the behaviour post 3.7.0.1? I had started a separate topic more than a year back, but received no response.

You can read about Secondary Index architecture here: https://www.aerospike.com/docs/architecture/secondary-index.html#index-management