Unsupported proto version errors after maintenance on Mesh cluster


#1

Unsupported proto version errors after maintenance on Mesh cluster

Problem Description

When an Aerospike cluster set up for Mesh heartbeats has been taken down for routine maintenance, when nodes are restarted they do not join the cluster and in their aerospike.log files the following message is displayed:

May 10 2016 19:25:44 GMT: WARNING (demarshal): (thr_demarshal.c:710) proto input from 192.168.1.100:54779: unsupported proto version 0
May 10 2016 19:25:45 GMT: WARNING (demarshal): (thr_demarshal.c:710) proto input from 192.168.1.101:52078: unsupported proto version 0

Explanation

This error message is saying that there are messages of an unexpected type being sent to the main server port (usually port 3000) the IP addresses sending these messages (given in the error line) will be other nodes in the cluster. The cause is that the nodes have been misconfigured during maintenance such that they are trying to use the main server port for Mesh heartbeats. In the Heartbeat stanza of those nodes the following will be observed:

heartbeat {
    mode mesh                   # Send heartbeats using Mesh (Unicast) protocol
    address 192.168.1.100       # (Optional) (Default: any) IP of the NIC on
                                # which this node is listening to heartbeat
    port 3002                   # port on which this node is listening to
                                # heartbeat
    mesh-seed-address-port 192.168.1.100 3000 # IP address for seed node in the cluster
                                              # This IP happens to be the local node
    mesh-seed-address-port 192.168.1.101 3000 # IP address for seed node in the cluster
    mesh-seed-address-port 192.168.1.102 3000 # IP address for seed node in the cluster
    mesh-seed-address-port 192.168.1.103 3000 # IP address for seed node in the cluster

    interval 150                # Number of milliseconds between heartbeats
    timeout 10                  # Number of heartbeat intervals to wait before
                                # timing out a node
  }

This is incorrect Mesh uses port 3002. Clients connect via port 3000.

Solution

The solution here is to check all /etc/aerospike/aerospike.conf files and make sure that the correct port, 3002, is used for Mesh heartbeats. The example stanza above will then look as follows:

heartbeat {
    mode mesh                   # Send heartbeats using Mesh (Unicast) protocol
    address 192.168.1.100       # (Optional) (Default: any) IP of the NIC on
                                # which this node is listening to heartbeat
    port 3002                   # port on which this node is listening to
                                # heartbeat
    mesh-seed-address-port 192.168.1.100 3002 # IP address for seed node in the cluster
                                              # This IP happens to be the local node
    mesh-seed-address-port 192.168.1.101 3002 # IP address for seed node in the cluster
    mesh-seed-address-port 192.168.1.102 3002 # IP address for seed node in the cluster
    mesh-seed-address-port 192.168.1.103 3002 # IP address for seed node in the cluster

    interval 150                # Number of milliseconds between heartbeats
    timeout 10                  # Number of heartbeat intervals to wait before
                                # timing out a node
  }

These parameters are static and so affected nodes will have to be restarted.

Notes

Full documentation at the link below

http://www.aerospike.com/docs/operations/configure/network/heartbeat/

Keywords

MESH UNSUPPORTED PROTO 3000 3002 PROTO INPUT

Timestamp

5/11/16