As mentioned above the steps taken :
- Added a node 10.0.29.212 with version community build 3.13.0.10
- Waited for migrations to finish (new cluster size 11). There were incoming Migrations on only 10.0.29.212 node as expected.
- Added 2 nodes 10.0.29.190 , 10.0.29.135 simultaneously with version community build 3.13.0.10.
- Waited for migrations to finish (new cluster size 13).Incoming Migrations on only these two nodes node as expected.
- Added a node 10.0.29.214 after few hours with version community build 3.13.0.10.
- Immediately after the node was added , the total master objects in the cluster dropped and incoming migrations started on all nodes and we started getting timeouts on cluster.
Debugging further i went through the logs and find out that after step 4 (cluster size 13), the moment we added node 10.0.29.214, we expected the cluster size to be 14.
What happened was in logs of 6 nodes we could see the cluster size dropping from 13 to 6, then 11 and then 14 finally with in a second. On the remaining 8 nodes the cluster size dropped from 13 to 11 and then 14.
Please see the logs for node where size dropped to 6 follwed by 11 and then 14
Jul 05 2019 04:53:01 GMT: IN strong text FO (hb): (hb.c:9665) node arrived bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (fabric): (fabric.c:2421) fabric: node bb9fef8c95b8d06 arrived
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2254) Skip node arrival bb9fef8c95b8d06 cluster principal bb9e244f471cf06 pulse principal bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:259) DISALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:432) cluster key set to 5fe908f890b4
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3139) SUCCESSION [1562302381]@bb9fef8c95b8d06*: bb9fef8c95b8d06 bb9e244f471cf06 bb93632e5e0a606 bb9347b9af89f06 bb90c21b623ec06 bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3150) node bb9fef8c95b8d06 is now principal pro tempore
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2343) Sent partition sync request to node bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3243) received partition sync message from bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2656) CLUSTER SIZE = 6
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {userdata} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {userdata} re-balanced, expected migrations - (440 tx, 301 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2572) {userdata} fresh-partitions 1471
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {user_config_data} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {user_config_data} re-balanced, expected migrations - (440 tx, 301 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2572) {user_config_data} fresh-partitions 1471
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:245) ALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2865) {1562302381} sending prepare_ack to bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:259) DISALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:432) cluster key set to e1eb5468cca7
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3139) SUCCESSION [1562302382]@bb9fef8c95b8d06*: bb9fef8c95b8d06 bb9e244f471cf06 bb9e03d704b2f06 bb992977e164a06 bb9843f865e7f06 bb976057c354f06 bb95098458cde06 bb93632e5e0a606 bb9347b9af89f06 bb90c21b623ec06 bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3150) node bb9fef8c95b8d06 is now principal pro tempore
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2343) Sent partition sync request to node bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3243) received partition sync message from bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2656) CLUSTER SIZE = 11
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {userdata} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {userdata} re-balanced, expected migrations - (653 tx, 379 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {user_config_data} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {user_config_data} re-balanced, expected migrations - (653 tx, 379 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:245) ALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (hb): (hb.c:2668) marking node add for paxos recovery: bb9260a743a1906
Jul 05 2019 04:53:01 GMT: INFO (hb): (hb.c:2668) marking node add for paxos recovery: bb92cff9e8f4006
Jul 05 2019 04:53:01 GMT: INFO (hb): (hb.c:2668) marking node add for paxos recovery: bb9147134689606
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2561) Cluster Integrity Check: Detected succession list discrepancy between node bb95098458cde06 and self bb9347b9af89f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Paxos List [bb9fef8c95b8d06,bb9e244f471cf06,bb9e03d704b2f06,bb992977e164a06,bb9843f865e7f06,bb976057c354f06,bb95098458cde06,bb93632e5e0a606,bb9347b9af89f06,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Node List [bb9e244f471cf06,bb9e03d704b2f06,bb992977e164a06,bb9843f865e7f06,bb976057c354f06,bb95098458cde06,bb93632e5e0a606,bb9347b9af89f06,bb92cff9e8f4006,bb9260a743a1906,bb9147134689606,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2561) Cluster Integrity Check: Detected succession list discrepancy between node bb93632e5e0a606 and self bb9347b9af89f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Paxos List [bb9fef8c95b8d06,bb9e244f471cf06,bb9e03d704b2f06,bb992977e164a06,bb9843f865e7f06,bb976057c354f06,bb95098458cde06,bb93632e5e0a606,bb9347b9af89f06,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Node List [bb9fef8c95b8d06,bb9e244f471cf06,bb93632e5e0a606,bb9347b9af89f06,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2561) Cluster Integrity Check: Detected succession list discrepancy between node bb902d507623f06 and self bb9347b9af89f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Paxos List [bb9fef8c95b8d06,bb9e244f471cf06,bb9e03d704b2f06,bb992977e164a06,bb9843f865e7f06,bb976057c354f06,bb95098458cde06,bb93632e5e0a606,bb9347b9af89f06,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:260) Node List [bb9fef8c95b8d06,bb9e244f471cf06,bb93632e5e0a606,bb9347b9af89f06,bb90c21b623ec06,bb902d507623f06]
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2424) Corrective changes: 3. Integrity fault: true
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb9e244f471cf06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb9e03d704b2f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb992977e164a06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb9843f865e7f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb976057c354f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb95098458cde06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb93632e5e0a606
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb9347b9af89f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb90c21b623ec06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2467) Marking node add for paxos recovery: bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2490) Skipping paxos recovery: bb9fef8c95b8d06 will handle the recovery
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2865) {1562302382} sending prepare_ack to bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:259) DISALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:432) cluster key set to e068a1496c6c
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3139) SUCCESSION [1562302383]@bb9fef8c95b8d06*: bb9fef8c95b8d06 bb9e244f471cf06 bb9e03d704b2f06 bb992977e164a06 bb9843f865e7f06 bb976057c354f06 bb95098458cde06 bb93632e5e0a606 bb9347b9af89f06 bb92cff9e8f4006 bb9260a743a1906 bb9147134689606 bb90c21b623ec06 bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3150) node bb9fef8c95b8d06 is now principal pro tempore
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2343) Sent partition sync request to node bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3243) received partition sync message from bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2656) CLUSTER SIZE = 14
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {userdata} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {userdata} re-balanced, expected migrations - (657 tx, 391 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1211) {user_config_data} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:2568) {user_config_data} re-balanced, expected migrations - (661 tx, 391 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:245) ALLOW MIGRATIONS
Jul 05 2019 04:53:02 GMT: WARNING (rw): (write.c:1795) {userdata} write_master: failed as_bin_cdt_alloc_modify_from_client() :0x1de9b38038df3b41d9505990d0f43579e4708837
Please see the logs for node where cluster size dropped to 11 and then 14
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:137) DISALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:153) cluster_key set to 0xe1eb5468cca7
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3160) SUCCESSION [1562302382]@bb9fef8c95b8d06*: bb9fef8c95b8d06 bb9e244f471cf06 bb9e03d704b2f06 bb992977e164a06 bb9843f865e7f06 bb976057c354f06 bb95098458cde06 bb93632e5e0a606 bb9347b9af89f06 bb90c21b623ec06 bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3171) node bb9fef8c95b8d06 is now principal pro tempore
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2385) Sent partition sync request to node bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3261) received partition sync message from bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:343) CLUSTER SIZE = 11
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1032) setting replication factors: cluster size 11, paxos single replica limit 1
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1047) {userdata} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1047) {user_config_data} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (config): (cluster_config.c:421) rack aware is disabled
Jul 05 2019 04:53:01 GMT: INFO (partition): (cluster_config.c:380) rack aware is disabled
Jul 05 2019 04:53:01 GMT: INFO (fabric): (fabric.c:1685) fabric: node bb9fef8c95b8d06 arrived
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:476) {userdata} re-balanced, expected migrations - (554 tx, 760 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:476) {user_config_data} re-balanced, expected migrations - (554 tx, 760 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:129) ALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: WARNING (smd): (system_metadata.c:1939) failed to get metadata operation from failed transaction response msg (err -2 ; fabric err 0)
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2886) {1562302382} sending prepare_ack to bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:137) DISALLOW MIGRATIONS
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:153) cluster_key set to 0xe068a1496c6c
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3160) SUCCESSION [1562302383]@bb9fef8c95b8d06*: bb9fef8c95b8d06 bb9e244f471cf06 bb9e03d704b2f06 bb992977e164a06 bb9843f865e7f06 bb976057c354f06 bb95098458cde06 bb93632e5e0a606 bb9347b9af89f06 bb92cff9e8f4006 bb9260a743a1906 bb9147134689606 bb90c21b623ec06 bb902d507623f06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3171) node bb9fef8c95b8d06 is now principal pro tempore
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:2385) Sent partition sync request to node bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (paxos): (paxos.c:3261) received partition sync message from bb9fef8c95b8d06
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:343) CLUSTER SIZE = 14
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1032) setting replication factors: cluster size 14, paxos single replica limit 1
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1047) {userdata} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:1047) {user_config_data} replication factor is 2
Jul 05 2019 04:53:01 GMT: INFO (config): (cluster_config.c:421) rack aware is disabled
Jul 05 2019 04:53:01 GMT: INFO (partition): (cluster_config.c:380) rack aware is disabled
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:476) {userdata} re-balanced, expected migrations - (530 tx, 669 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:476) {user_config_data} re-balanced, expected migrations - (530 tx, 670 rx)
Jul 05 2019 04:53:01 GMT: INFO (partition): (partition_balance.c:129) ALLOW MIGRATIONS
**Node Info **
Node |
NodeId |
Version |
10-0-23-103 |
BB95098458CDE06 |
E-3.9.1-158-g1e8db6e |
10-0-23-142 |
BB9E244F471CF06 |
E-3.9.1-158-g1e8db6e |
10-0-23-169 |
BB9E03D704B2F06 |
E-3.9.1-158-g1e8db6e |
10-0-23-171 |
BB92CFF9E8F4006 |
E-3.9.1-158-g1e8db6e |
10-0-23-185 |
BB902D507623F06 |
E-3.9.1-158-g1e8db6e |
10-0-23-62 |
BB9147134689606 |
E-3.9.1-158-g1e8db6e |
10-0-23-71 |
BB9843F865E7F06 |
E-3.9.1-158-g1e8db6e |
10-0-23-76 |
BB9260A743A1906 |
E-3.9.1-158-g1e8db6e |
10-0-23-8 |
BB992977E164A06 |
E-3.9.1-158-g1e8db6e |
10-0-29-135 |
BB90C21B623EC06 |
C-3.13.0.10 |
10-0-29-190 |
BB93632E5E0A606 |
C-3.13.0.10 |
10-0-29-212 |
BB9347B9AF89F06 |
C-3.13.0.10 |
10-0-29-214 |
*BB9FEF8C95B8D06 |
C-3.13.0.10 |