Rebalance cluster after deletion data


#1

Hello, Databases files *.data from /opt/aerospike/data/ have been deleted from one of several servers. After start aerospike with deleted data files, cluster is not rebalanced. That is, one node empty, another - with their data. How to fix it?


#2

Can you show the output for the following?

asadm -e info
asadm -e 'show stat like migrat'
asadm -e 'show config like migrat'

#3
asadm -e info

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                          Node               Node               Ip        Build   Cluster        Cluster     Cluster         Principal   Client        Uptime   
                             .                 Id                .            .      Size            Key   Integrity                 .    Conns             .   
10.4.0.18:3000                   *BB9B1B634671E00   10.4.0.18:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1581   7404:56:39    
10.4.0.19:3000                   BB93DAD34671E00    10.4.0.19:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1680   11100:30:51   
10.4.0.20:3000                   BB965B234671E00    10.4.0.20:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1758   11100:30:49   
10.4.0.21:3000                   BB90AB334671E00    10.4.0.21:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1736   11100:30:48   
10.4.0.22:3000                   BB997E443671E00    10.4.0.22:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1361   11100:22:30   
10.4.0.23:3000                   BB994E243671E00    10.4.0.23:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1849   11100:22:31   
10.4.0.25:3000                   BB95BAD34671E00    10.4.0.25:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1400   11100:22:35   
localhost-int.localdomain:3000   BB946B334671E00    10.4.0.24:3000   C-3.14.1.3         8   2B4EF4839140   True        BB9B1B634671E00     1250   11:43:56      
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Namespace                             Node   Avail%   Evictions                 Master                Replica     Repl     Stop             Pending         Disk    Disk     HWM          Mem     Mem    HWM      Stop   
          .                                .        .           .   (Objects,Tombstones)   (Objects,Tombstones)   Factor   Writes            Migrates         Used   Used%   Disk%         Used   Used%   Mem%   Writes%   
          .                                .        .           .                      .                      .        .        .             (tx,rx)            .       .       .            .       .      .         .   
visitor       10.4.0.18:3000                   36         0.000     (291.530 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      75.328 GB   45      80       41.200 GB   59      80     90        
visitor       10.4.0.19:3000                   34         0.000     (296.671 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      76.655 GB   46      80       41.926 GB   60      80     90        
visitor       10.4.0.20:3000                   39         0.000     (277.788 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      71.782 GB   43      80       39.263 GB   56      80     90        
visitor       10.4.0.21:3000                   33         0.000     (303.524 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      78.426 GB   47      80       42.895 GB   61      80     90        
visitor       10.4.0.22:3000                   37         0.000     (284.076 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      73.401 GB   44      80       40.146 GB   57      80     90        
visitor       10.4.0.23:3000                   32         0.000     (307.543 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      79.468 GB   47      80       43.466 GB   62      80     90        
visitor       10.4.0.25:3000                   37         0.000     (287.522 M, 0.000)     (0.000,  0.000)        1        false    (0.000,  0.000)      74.296 GB   44      80       40.637 GB   58      80     90        
visitor       localhost-int.localdomain:3000   89         0.000     (70.626 M, 0.000)      (0.000,  0.000)        1        false    (0.000,  0.000)      17.036 GB   11      80        8.722 GB   13      80     90        
visitor                                                   0.000     (2.119 B, 0.000)       (0.000,  0.000)                          (0.000,  0.000)     546.391 GB                   298.255 GB                            



asadm -e 'show stat like migrat'

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                        :   10.4.0.18:3000   10.4.0.19:3000   10.4.0.20:3000   10.4.0.21:3000   10.4.0.22:3000   10.4.0.23:3000   10.4.0.25:3000   localhost-int.localdomain:3000   
migrate_allowed             :   true             true             true             true             true             true             true             true                             
migrate_partitions_remaining:   0                0                0                0                0                0                0                0                                

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~visitor Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   10.4.0.18:3000   10.4.0.19:3000   10.4.0.20:3000   10.4.0.21:3000   10.4.0.22:3000   10.4.0.23:3000   10.4.0.25:3000   localhost-int.localdomain:3000   
migrate-order                  :   5                5                5                5                5                5                5                5                                
migrate-retransmit-ms          :   5000             5000             5000             5000             5000             5000             5000             5000                             
migrate-sleep                  :   50               50               50               50               50               50               50               50                               
migrate_record_receives        :   8007             30               19               26               22               57               29               438                              
migrate_record_retransmits     :   0                0                0                0                0                0                0                0                                
migrate_records_skipped        :   0                0                0                0                0                0                0                0                                
migrate_records_transmitted    :   645877           586021           466896           474857           557671           576725           437672           0                                
migrate_rx_instances           :   0                0                0                0                0                0                0                0                                
migrate_rx_partitions_active   :   0                0                0                0                0                0                0                0                                
migrate_rx_partitions_initial  :   0                0                0                0                0                0                0                512                              
migrate_rx_partitions_remaining:   0                0                0                0                0                0                0                0                                
migrate_signals_active         :   0                0                0                0                0                0                0                0                                
migrate_signals_remaining      :   0                0                0                0                0                0                0                0                                
migrate_tx_instances           :   0                0                0                0                0                0                0                0                                
migrate_tx_partitions_active   :   0                0                0                0                0                0                0                0                                
migrate_tx_partitions_imbalance:   0                0                0                0                0                0                0                0                                
migrate_tx_partitions_initial  :   88               80               64               65               76               79               60               0                                
migrate_tx_partitions_remaining:   0                0                0                0                0                0                0                0                                



asadm -e 'show config like migrat'

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                    :   10.4.0.18:3000   10.4.0.19:3000   10.4.0.20:3000   10.4.0.21:3000   10.4.0.22:3000   10.4.0.23:3000   10.4.0.25:3000   localhost-int.localdomain:3000   
migrate-max-num-incoming:   4                4                4                4                4                4                4                4                                
migrate-threads         :   4                4                4                4                4                4                4                4                                

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~visitor Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   10.4.0.18:3000   10.4.0.19:3000   10.4.0.20:3000   10.4.0.21:3000   10.4.0.22:3000   10.4.0.23:3000   10.4.0.25:3000   localhost-int.localdomain:3000   
migrate-order        :   5                5                5                5                5                5                5                5                                
migrate-retransmit-ms:   5000             5000             5000             5000             5000             5000             5000             5000                             
migrate-sleep        :   50               50               50               50               50               50               50               50

#4

You are running with replication-factor 1, therefore there is only a single copy of your data in the cluster. When the data files were deleted, you lost the only copy of many partitions of data which, unless you have a backup, cannot be recovered.

The partitions are deterministically balanced as usual, but the lost partitions are nearly empty, this is why one of your nodes has a small fraction of the records the others hold.


#5

That is, no way to make the servers share data with empty server so that all nodes have the same amount? Like when adding a new node.

Maybe, full dump, remove, restore? Or, change server IP and name and add it to cluster again as new node?


#6

Aerospike distributes records over partitions based on a hashing algorithm. A backup/restore would be ineffective in redistributing records since the records will hash to the same partitions again.

You can cause the partition distribution to shuffle by changing the node-id (requires Aerospike 3.16.0.1+).


#7

To change the node-id, just change the fabric port on one of the nodes. https://www.aerospike.com/docs/reference/configuration/index.html#port just specify a new fabric port on the node with the imbalance and the other nodes will start communicating to it on that new port (no change required on the other nodes is what im trying to say). A recycle is required because the config item is static.