Node will not join cluster after upgrade

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Node will not join cluster after upgrade

Problem Description

When a cluster is being upgraded from Aerospike 3.13 to a version greater than Aerospike 3.14 the first node upgraded starts but will not join the cluster and, instead, forms a single node cluster on its own. The messages in the aerospike.log will show as follows:

Aug 13 2019 08:40:27 GMT: WARNING (hb): (hb.c:4647) (repeated:9) unable to parse heartbeat message on fd 71
Aug 13 2019 08:40:27 GMT: WARNING (hb): (hb.c:4647) (repeated:4) unable to parse heartbeat message on fd 74
Aug 13 2019 08:40:27 GMT: WARNING (hb): (hb.c:4647) (repeated:24) unable to parse heartbeat message on fd 68

These messages will display even when there have been no changes to mesh config or routing.

Explanation

The inability to parse heartbeat messages indicates that the upgraded node and the remaining nodes are using a different cluster protocol. Aerospike 3.13 brought in a change to the cluster protocol that allowed huge improvements to cluster performance however, the new protocol is not compatibale with previous versions. This change does not happen automatically as part of the Aerospike 3.13 upgrade but instead must be done post-upgrade by running a script. If, for some reason, the script has not be run and the cluster protocol has not been changed. Versions of Aerospike later than 3.14 must use the new protocol.

Old nodes will show the old cluster protocol as follows:

Admin> show config like protocol
~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE              :   172.17.0.3:3000   172.17.0.4:3000   86804bde1c48:3000   
heartbeat.protocol:   v2                v2                v2                  
paxos-protocol    :   v3                v3                v3                  

~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE              :   172.17.0.3:3000   172.17.0.4:3000   86804bde1c48:3000   
heartbeat.protocol:   v2                v2                v2                  

Admin> 

The upgraded node will show:

Admin> show config like protocol
~~~~~~~~~~~~~~Service Configuration (2019-08-13 17:16:02 UTC)~~~~~~~~~~~~~~
NODE              :   172.17.0.6:3000      
heartbeat.protocol:   v3                           

~~~~~~~~~~~~~~Network Configuration (2019-08-13 17:16:02 UTC)~~~~~~~~~~~~~~
NODE              :   172.17.0.6:3000     
heartbeat.protocol:   v3                             

Admin> 

The heartbeat protocol is different and the upgraded node will not show paxos-protocol as this is now deprecated.

Solution

The simplest solution is to run the cluster protocol script on the remaining nodes within the cluster and then continue with the upgrades. Steps are as follows:

  • Shutdown the upgraded node.
  • Allow migrations to finish in the remaining cluster.
  • When migrations are finished run the script as described in the Aerospike 3.13 Special Upgrade Instructions
  • When cluster protocol has been changed and checked restart the upgraded node.
  • Continue with the upgrade as planned.

Notes

  • All Aerospike special upgrade instructions can be found on the Special Upgrades documentation page.
  • The current cluster protocol version can be checked prior to upgrade using the asadm command line tool.

Keywords

UNABLE TO PARSE HEARTBEAT FD PAXOS PROTOCOL UPGRADE CLUSTER

Timestamp

August 2019