Why replica count might appear to drop suddenly during migrations


This behavior is not expected for server versions 3.13 and above that are on the new clustering protocol (paxos-protocol v5) which enhances the clustering algorithm by not dropping the replica partitions until it is synchronized.

Why replica record count might appear to drop suddenly during migrations

Issue: During migration replica record count appears to change when node is viewed via:

asadm -e info


When nodes are migrating partitions, the replica record count drops, sometimes severely, an extreme example is shown below.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Node    Namespace   Avail%   Evictions    Master    Replica     Repl     Stop    Migrates       Disk    Disk     HWM          Mem     Mem    HWM      Stop   
           .            .        .           .   Objects    Objects   Factor   Writes     (tx,rx)       Used   Used%   Disk%         Used   Used%   Mem%   Writes%   
aero-10:3000   persistent       32   206131141   1.754 G    0.000          2   false    (N/E,N/E)   1.841 TB      48      75   210.635 GB      61     65        90   
aero-11:3000   persistent       37           0   1.646 G    0.000          2   false    (N/E,N/E)   1.724 TB      45      75   197.253 GB      57     65        90   
aero-12:3000   persistent       35           0   1.821 G    0.000          2   false    (N/E,N/E)   1.767 TB      46      75   202.210 GB      58     65        90   


This is expected behaviour. The objects of desynchronised replica partitions are not counted until they become synchronised. This means that if a partition is not the master or acting master and has an inbound migration, the objects in that partition will not be counted, and so we may see a sharp drop in replica record numbers. This can be particularly obvious when a specific node has a high number of outbound migrations. In that scenario the other nodes in the cluster would have a high number of scheduled inbound migrations and the replica objects would not be counted.

This is normal behaviour. The replica record count per node will stabilise once migrations are completed.

Replicas invalidated after restart