Hello, Databases files *.data from /opt/aerospike/data/ have been deleted from one of several servers. After start aerospike with deleted data files, cluster is not rebalanced. That is, one node empty, another - with their data. How to fix it?
Can you show the output for the following?
asadm -e info
asadm -e 'show stat like migrat'
asadm -e 'show config like migrat'
asadm -e info
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Node Ip Build Cluster Cluster Cluster Principal Client Uptime
. Id . . Size Key Integrity . Conns .
10.4.0.18:3000 *BB9B1B634671E00 10.4.0.18:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1581 7404:56:39
10.4.0.19:3000 BB93DAD34671E00 10.4.0.19:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1680 11100:30:51
10.4.0.20:3000 BB965B234671E00 10.4.0.20:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1758 11100:30:49
10.4.0.21:3000 BB90AB334671E00 10.4.0.21:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1736 11100:30:48
10.4.0.22:3000 BB997E443671E00 10.4.0.22:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1361 11100:22:30
10.4.0.23:3000 BB994E243671E00 10.4.0.23:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1849 11100:22:31
10.4.0.25:3000 BB95BAD34671E00 10.4.0.25:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1400 11100:22:35
localhost-int.localdomain:3000 BB946B334671E00 10.4.0.24:3000 C-3.14.1.3 8 2B4EF4839140 True BB9B1B634671E00 1250 11:43:56
Number of rows: 8
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace Node Avail% Evictions Master Replica Repl Stop Pending Disk Disk HWM Mem Mem HWM Stop
. . . . (Objects,Tombstones) (Objects,Tombstones) Factor Writes Migrates Used Used% Disk% Used Used% Mem% Writes%
. . . . . . . . (tx,rx) . . . . . . .
visitor 10.4.0.18:3000 36 0.000 (291.530 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 75.328 GB 45 80 41.200 GB 59 80 90
visitor 10.4.0.19:3000 34 0.000 (296.671 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 76.655 GB 46 80 41.926 GB 60 80 90
visitor 10.4.0.20:3000 39 0.000 (277.788 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 71.782 GB 43 80 39.263 GB 56 80 90
visitor 10.4.0.21:3000 33 0.000 (303.524 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 78.426 GB 47 80 42.895 GB 61 80 90
visitor 10.4.0.22:3000 37 0.000 (284.076 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 73.401 GB 44 80 40.146 GB 57 80 90
visitor 10.4.0.23:3000 32 0.000 (307.543 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 79.468 GB 47 80 43.466 GB 62 80 90
visitor 10.4.0.25:3000 37 0.000 (287.522 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 74.296 GB 44 80 40.637 GB 58 80 90
visitor localhost-int.localdomain:3000 89 0.000 (70.626 M, 0.000) (0.000, 0.000) 1 false (0.000, 0.000) 17.036 GB 11 80 8.722 GB 13 80 90
visitor 0.000 (2.119 B, 0.000) (0.000, 0.000) (0.000, 0.000) 546.391 GB 298.255 GB
asadm -e 'show stat like migrat'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : 10.4.0.18:3000 10.4.0.19:3000 10.4.0.20:3000 10.4.0.21:3000 10.4.0.22:3000 10.4.0.23:3000 10.4.0.25:3000 localhost-int.localdomain:3000
migrate_allowed : true true true true true true true true
migrate_partitions_remaining: 0 0 0 0 0 0 0 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~visitor Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : 10.4.0.18:3000 10.4.0.19:3000 10.4.0.20:3000 10.4.0.21:3000 10.4.0.22:3000 10.4.0.23:3000 10.4.0.25:3000 localhost-int.localdomain:3000
migrate-order : 5 5 5 5 5 5 5 5
migrate-retransmit-ms : 5000 5000 5000 5000 5000 5000 5000 5000
migrate-sleep : 50 50 50 50 50 50 50 50
migrate_record_receives : 8007 30 19 26 22 57 29 438
migrate_record_retransmits : 0 0 0 0 0 0 0 0
migrate_records_skipped : 0 0 0 0 0 0 0 0
migrate_records_transmitted : 645877 586021 466896 474857 557671 576725 437672 0
migrate_rx_instances : 0 0 0 0 0 0 0 0
migrate_rx_partitions_active : 0 0 0 0 0 0 0 0
migrate_rx_partitions_initial : 0 0 0 0 0 0 0 512
migrate_rx_partitions_remaining: 0 0 0 0 0 0 0 0
migrate_signals_active : 0 0 0 0 0 0 0 0
migrate_signals_remaining : 0 0 0 0 0 0 0 0
migrate_tx_instances : 0 0 0 0 0 0 0 0
migrate_tx_partitions_active : 0 0 0 0 0 0 0 0
migrate_tx_partitions_imbalance: 0 0 0 0 0 0 0 0
migrate_tx_partitions_initial : 88 80 64 65 76 79 60 0
migrate_tx_partitions_remaining: 0 0 0 0 0 0 0 0
asadm -e 'show config like migrat'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : 10.4.0.18:3000 10.4.0.19:3000 10.4.0.20:3000 10.4.0.21:3000 10.4.0.22:3000 10.4.0.23:3000 10.4.0.25:3000 localhost-int.localdomain:3000
migrate-max-num-incoming: 4 4 4 4 4 4 4 4
migrate-threads : 4 4 4 4 4 4 4 4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~visitor Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : 10.4.0.18:3000 10.4.0.19:3000 10.4.0.20:3000 10.4.0.21:3000 10.4.0.22:3000 10.4.0.23:3000 10.4.0.25:3000 localhost-int.localdomain:3000
migrate-order : 5 5 5 5 5 5 5 5
migrate-retransmit-ms: 5000 5000 5000 5000 5000 5000 5000 5000
migrate-sleep : 50 50 50 50 50 50 50 50
You are running with replication-factor 1, therefore there is only a single copy of your data in the cluster. When the data files were deleted, you lost the only copy of many partitions of data which, unless you have a backup, cannot be recovered.
The partitions are deterministically balanced as usual, but the lost partitions are nearly empty, this is why one of your nodes has a small fraction of the records the others hold.
That is, no way to make the servers share data with empty server so that all nodes have the same amount? Like when adding a new node.
Maybe, full dump, remove, restore? Or, change server IP and name and add it to cluster again as new node?
Aerospike distributes records over partitions based on a hashing algorithm. A backup/restore would be ineffective in redistributing records since the records will hash to the same partitions again.
You can cause the partition distribution to shuffle by changing the node-id
(requires Aerospike 3.16.0.1+).
To change the node-id, just change the fabric port on one of the nodes. https://www.aerospike.com/docs/reference/configuration/index.html#port just specify a new fabric port on the node with the imbalance and the other nodes will start communicating to it on that new port (no change required on the other nodes is what im trying to say). A recycle is required because the config item is static
.