Hi Kevin, thanks for your patient reply.
I try dun and rolling start all the node and wait more than 12 hours. The migration still cannot complete, especially the empty namespace also cannot complete. And it seems one node cannot replicate objects and has 1000+ connection. Do you have any suggestion ?
Namespace Node Avail% Evictions Master Replica Repl Stop Pending Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Migrates Used Used% Disk% Used Used% Mem% Writes%
. . . . . . . . (tx%,rx%) . . . . . . .
product 10.17.1.11:3000 85 0.000 5.536 M 4.504 M 2 false (35,14) 111.191 GB 14 90 665.058 MB 22 90 90
product 10.17.1.12:3000 85 0.000 5.508 M 5.410 M 2 false (63,1) 111.503 GB 14 90 666.769 MB 22 90 90
product 10.17.1.26:3000 86 0.000 5.197 M 5.461 M 2 false (2,3) 108.752 GB 14 90 650.477 MB 22 90 90
product 10.17.1.7:3000 85 0.000 5.532 M 4.572 M 2 false (14,13) 113.009 GB 15 90 676.153 MB 23 90 90
product 0.000 21.773 M 19.947 M (28,8) 444.455 GB 2.596 GB
staging 10.17.1.11:3000 99 0.000 0.000 0.000 1 false (0,42) 0.000 B 0 50 0.000 B 0 60 90
staging 10.17.1.12:3000 99 0.000 0.000 0.000 1 false (18,0) 0.000 B 0 50 0.000 B 0 60 90
staging 10.17.1.26:3000 99 0.000 0.000 0.000 1 false (0,39) 0.000 B 0 50 0.000 B 0 60 90
staging 10.17.1.7:3000 99 0.000 0.000 0.000 1 false (52,44) 0.000 B 0 50 0.000 B 0 60 90
staging 0.000 0.000 0.000 (10,35) 0.000 B 0.000 B
test 10.17.1.11:3000 99 0.000 0.000 0.000 2 false (0,18) 0.000 B 0 50 0.000 B 0 60 90
test 10.17.1.12:3000 99 0.000 0.000 0.000 2 false (79,40) 0.000 B 0 50 0.000 B 0 60 90
test 10.17.1.26:3000 99 0.000 0.000 0.000 2 false (0,16) 0.000 B 0 50 0.000 B 0 60 90
test 10.17.1.7:3000 99 0.000 0.000 0.000 2 false (0,17) 0.000 B 0 50 0.000 B 0 60 90
test 0.000 0.000 0.000 (16,25) 0.000 B 0.000 B
Number of rows: 15
in log, the same issue is triggered.
The same in empty namespace test.
Oct 12 2016 03:05:27 GMT: WARNING (partition): (partition.c:1455) {test:4090} emigrate done: failed with error and cluster key is current
Oct 12 2016 03:05:27 GMT: WARNING (migrate): (migrate.c:1002) imbalance: dest refused migrate with ACK_FAIL
Oct 12 2016 03:05:27 GMT: WARNING (partition): (partition.c:1455) {test:4093} emigrate done: failed with error and cluster key is current
Oct 12 2016 03:05:28 GMT: WARNING (migrate): (migrate.c:1002) imbalance: dest refused migrate with ACK_FAIL
Oct 12 2016 03:05:28 GMT: WARNING (partition): (partition.c:1455) {product:2315} emigrate done: failed with error and cluster key is current
Oct 12 2016 03:05:28 GMT: WARNING (migrate): (migrate.c:1002) imbalance: dest refused migrate with ACK_FAIL
Just to be sure, there aren’t any warnings other than those coming form partition.c:1455 and migrate.c:1002?
There should be at least one more line warning, these are both emigrator warnings alerting that the immigrator failed, there will also be a warning on the immigrator’s side.
I have fixed this issue. I have done following operations and the migration warning gone away.
Fixed a typo in aerospike.conf, in an empty namespace test, the file name in one node is different from others.
namespace test {
replication-factor 2
memory-size 1G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine device {
file /aerospike/test/data.bat
filesize 100G
data-in-memory false
}
}
3 nodes set to file /aerospike/test/data.bat, 1 node set to data.dat. I fix it.
Place all namespaces to the same order in conf file.
Set migrate-threads 8 to speed up migration
Change heartbeat port to other rather than 3002 and change heartbeat interval to 500, timeout to 50.
Backup aerospike raw data without secondary index and UDF
Delete data files, rolling restart all nodes and restore by aerospike raw backup data.
Manually add secondary index and UDF.
After above changes, only following warning still remain hare at the beginning of restart
Oct 13 2016 10:08:36 GMT: WARNING (hb): (hb.c:1340) rejected tip for self: 10.17.1.12:3432
Oct 13 2016 10:08:36 GMT: WARNING (hb): (hb.c:2142) couldn’t add remote heartbeat service 10.17.1.12:3432 to mesh host list.
Other warnings have gone away including “recv peer error”, @Kavin, if you have any suggestion on “rejected tip for self”, please let me know.