Cluster became broken


#1

I’m using mesh mode to form a cluster. It worked fine until some time ago the cluster is being broken. All nodes are running with Aerospike but there is no solid cluster. Aerospike is running under docker in host mode. Here is my config:

# Aerospike database configuration file.

# This stanza must come first.
service {
    user root
    group root
#    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    service-threads 16
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 100000
    proto-fd-idle-ms 10000
    paxos-recovery-policy auto-reset-master
    paxos-max-cluster-size 60
    query-in-transaction-thread true
    allow-inline-transactions false
}

logging {

    # Log file must be an absolute path.
    file /var/log/aerospike/aerospike.log {
        context any info
    }

    # Send log messages to stdout
    console {
        context any critical
    }
}

network {
    service {
        address any
        port 3000

        # Uncomment the following to set the `access-address` parameter to the
        # IP address of the Docker host. This will the allow the server to correctly
        # publish the address which applications and other nodes in the cluster to
        # use when addressing this node.
        access-address 10.x.x.30
    }

    heartbeat {
        #mode multicast
        #address 239.1.99.2
        #port 9918

        # mesh is used for environments that do not support multicast
        mode mesh
        port 3002

        mesh-seed-address-port node.dc01.domain 3002
        mesh-seed-address-port node.dc02.domain 3002
        mesh-seed-address-port node.dc03.domain 3002
        mesh-seed-address-port node.dc04.domain 3002
        mesh-seed-address-port node.dc05.domain 3002
        mesh-seed-address-port node.dc06.domain 3002
        mesh-seed-address-port node.dc07.domain 3002
        mesh-seed-address-port node.dc08.domain 3002
        mesh-seed-address-port node.dc09.domain 3002
        mesh-seed-address-port node.dc10.domain 3002
        mesh-seed-address-port node.dc11.domain 3002
        mesh-seed-address-port node.dc12.domain 3002
        mesh-seed-address-port node.dc13.domain 3002
        mesh-seed-address-port node.dc14.domain 3002
        mesh-seed-address-port node.dc15.domain 3002
        mesh-seed-address-port node.dc16.domain 3002
        mesh-seed-address-port node.dc17.domain 3002
        mesh-seed-address-port node.dc18.domain 3002
        mesh-seed-address-port node.dc19.domain 3002
        mesh-seed-address-port node.dc20.domain 3002
        mesh-seed-address-port node.dc21.domain 3002
        mesh-seed-address-port node.dc22.domain 3002
        mesh-seed-address-port node.dc23.domain 3002
        mesh-seed-address-port node.dc24.domain 3002
        mesh-seed-address-port node.dc25.domain 3002
        mesh-seed-address-port node.dc26.domain 3002
        mesh-seed-address-port node.dc27.domain 3002
        mesh-seed-address-port node.dc28.domain 3002
        mesh-seed-address-port node.dc29.domain 3002
        mesh-seed-address-port node.dc30.domain 3002
        mesh-seed-address-port node.dc31.domain 3002
        mesh-seed-address-port node.dc32.domain 3002
        mesh-seed-address-port node.dc33.domain 3002

        # use asinfo -v 'tip:host=<ADDR>;port=3002' to inform cluster of
        # other mesh nodes

        interval 150
        timeout 20
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace test {
    replication-factor 2
    memory-size 5G
    default-ttl 1d # 5 days, use 0 to never expire/evict.

    storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
    #storage-engine device {
    #    file /opt/aerospike/data/test.dat
    #    filesize 4G
    #    data-in-memory true # Store data in memory in addition to file.
    #}
}

namespace incProfiles {
    replication-factor 2
    memory-size 90G
    default-ttl 6d

    storage-engine memory
    high-water-memory-pct 90
    stop-writes-pct 95
    #high-water-disk-pct 50
    #conflict-resolution-policy=generation
}

Here is network status of the cluster in bad state:

Admin> info network
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                            node.dc               node.dc               Ip     Build   Cluster            Cluster     Cluster         Principal   Client      Uptime
                               .                 Id                .         .      Size                Key   Integrity                 .    Conns           .
10.x.x.10:3000                     000000000000000    10.x.x.10:3000   N/E           N/E   N/E                N/E         N/E                  N/E   N/E
10.x.x.17:3000                     000000000000000    10.x.x.17:3000   N/E           N/E   N/E                N/E         N/E                  N/E   N/E
10.x.x.19:3000                     000000000000000    10.x.x.19:3000   N/E           N/E   N/E                N/E         N/E                  N/E   N/E
10.x.x.40:3000                     000000000000000    10.x.x.40:3000   N/E           N/E   N/E                N/E         N/E                  N/E   N/E
10.x.x.42:3000                     000000000000000    10.x.x.42:3000   N/E           N/E   N/E                N/E         N/E                  N/E   N/E
node.dc02.domain:3000   BB970ED6C3ACAB8    10.x.x.11:3000   C-3.9.1         1   C8CD59622BFE51BF   True        BB970ED6C3ACAB8        5   618:25:27
node.dc03.domain:3000   BB9000A6E3ACAB8    10.x.x.12:3000   C-3.9.1         1   FD1C975115230888   True        BB9000A6E3ACAB8        6   618:25:27
node.dc04.domain:3000   BB9B0EA6C3ACAB8    10.x.x.13:3000   C-3.9.1         2   C2E9C37D348C8DAF   False       BB9B0EA6C3ACAB8        9   618:25:26
node.dc06.domain:3000   BB988EC6C3ACAB8    10.x.x.15:3000   C-3.9.1         1   9B4764CE266D06F1   True        BB988EC6C3ACAB8        6   618:25:26
node.dc07.domain:3000   BB9D0F96C3ACAB8    10.x.x.16:3000   C-3.9.1         1   9D14A9D9CC435FBB   True        BB9D0F96C3ACAB8        6   618:25:26
node.dc09.domain:3000   BB958F66C3ACAB8    10.x.x.18:3000   C-3.9.1         1   CF6FA645E615F23E   True        BB958F66C3ACAB8        5   618:25:26
node.dc11.domain:3000   BB950EF6C3ACAB8    10.x.x.20:3000   C-3.9.1         1   9681209EDEE43B08   True        BB950EF6C3ACAB8        6   618:25:26
node.dc12.domain:3000   BB920F56C3ACAB8    10.x.x.21:3000   C-3.9.1         1   DF3C2415BF19C57D   True        BB920F56C3ACAB8        6   618:25:26
node.dc13.domain:3000   *BB9F0ED6C3ACAB8   10.x.x.22:3000   C-3.9.1         1   A262B11D8B30F641   True        BB9F0ED6C3ACAB8      178   618:25:26
node.dc14.domain:3000   BB9B8F56C3ACAB8    10.x.x.23:3000   C-3.9.1         2   7B07432253E20B4A   False       BB9B8F56C3ACAB8        9   618:25:26
node.dc15.domain:3000   BB900FD6C3ACAB8    10.x.x.24:3000   C-3.9.1         2   7B07432253E20B4A   False       BB9B8F56C3ACAB8        9   618:25:26
node.dc16.domain:3000   BB9B0F26C3ACAB8    10.x.x.25:3000   C-3.9.1         1   57E1D8F60D5C3C71   True        BB9B0F26C3ACAB8        6   618:25:26
node.dc17.domain:3000   BB948086E3ACAB8    10.x.x.26:3000   C-3.9.1         1   13D50C00688A09D2   True        BB948086E3ACAB8        5   618:25:26
node.dc18.domain:3000   BB930046D3ACAB8    10.x.x.27:3000   C-3.9.1         1   B1A1F16BAEE15751   True        BB930046D3ACAB8        6   618:25:26
node.dc19.domain:3000   BB9A0F76C3ACAB8    10.x.x.28:3000   C-3.9.1         1   6C3707DDDB0106D6   True        BB9A0F76C3ACAB8        5   618:25:26
node.dc20.domain:3000   BB9C0106E3ACAB8    10.x.x.29:3000   C-3.9.1         1   C249ACA190A7C653   True        BB9C0106E3ACAB8      410   618:25:26
node.dc21.domain:3000   BB9A8F76C3ACAB8    10.x.x.30:3000   C-3.9.1         2   B9ADA580898A8ED2   False       BB9D0EE6C3ACAB8        9   618:25:26
node.dc22.domain:3000   BB978F36D3ACAB8    10.x.x.31:3000   C-3.9.1         1   972A9B8E5F10269B   True        BB978F36D3ACAB8        6   618:25:26
node.dc23.domain:3000   BB9E0106E3ACAB8    10.x.x.32:3000   C-3.9.1         1   ED943079A4BEBDB6   True        BB9E0106E3ACAB8        6   618:25:26
node.dc24.domain:3000   BB9C0086E3ACAB8    10.x.x.33:3000   C-3.9.1         1   DD55D01A8C0729D7   True        BB9C0086E3ACAB8        5   618:25:26
node.dc25.domain:3000   BB9C0F06D3ACAB8    10.x.x.34:3000   C-3.9.1         1   4445F126014122D4   True        BB9C0F06D3ACAB8        6   618:25:26
node.dc26.domain:3000   BB9A8BF6F3ACAB8    10.x.x.35:3000   C-3.9.1         1   2AF01ED007DE00B0   True        BB9A8BF6F3ACAB8        8   618:25:26
node.dc27.domain:3000   BB9D8056E3ACAB8    10.x.x.36:3000   C-3.9.1         1   E956B97BC70D5C9D   True        BB9D8056E3ACAB8        2   618:25:26
node.dc28.domain:3000   BB9E0EE6C3ACAB8    10.x.x.37:3000   C-3.9.1         1   BBF8236C102B8BC1   True        BB9E0EE6C3ACAB8        6   618:25:26
node.dc29.domain:3000   BB9380D6E3ACAB8    10.x.x.38:3000   C-3.9.1         1   ED29A5BF080900CE   True        BB9380D6E3ACAB8        6   618:25:26
node.dc30.domain:3000   BB9A0BD6F3ACAB8    10.x.x.39:3000   C-3.9.1         2   C2E9C37D348C8DAF   False       BB9B0EA6C3ACAB8        9   618:25:26
node.dc32.domain:3000   BB9D0EE6C3ACAB8    10.x.x.41:3000   C-3.9.1         2   B9ADA580898A8ED2   False       BB9D0EE6C3ACAB8        9   618:25:26
Number of rows: 32

Configs are almost the same (exclude replication factor and access address):

Admin> show config diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                  :   10.x.x.10:3000   10.x.x.17:3000   10.x.x.19:3000   10.x.x.40:3000   10.x.x.42:3000   node.dc02.domain:3000   node.dc03.domain:3000   node.dc04.domain:3000   node.dc06.domain:3000   node.dc07.domain:3000   node.dc09.domain:3000   node.dc11.domain:3000   node.dc12.domain:3000   node.dc13.domain:3000   node.dc14.domain:3000   node.dc15.domain:3000   node.dc16.domain:3000   node.dc17.domain:3000   node.dc18.domain:3000   node.dc19.domain:3000   node.dc20.domain:3000   node.dc21.domain:3000   node.dc22.domain:3000   node.dc23.domain:3000   node.dc24.domain:3000   node.dc25.domain:3000   node.dc26.domain:3000   node.dc27.domain:3000   node.dc28.domain:3000   node.dc29.domain:3000   node.dc30.domain:3000   node.dc32.domain:3000
service.access-address:   N/E              N/E              N/E              N/E              N/E              10.x.x.11                          N/E                                N/E                                N/E                                N/E                                10.x.x.18                          N/E                                N/E                                10.x.x.22                          N/E                                N/E                                N/E                                10.x.x.26                          N/E                                10.x.x.28                          N/E                                N/E                                N/E                                N/E                                10.x.x.33                          N/E                                10.x.x.35                          10.x.x.36                          N/E                                N/E                                N/E                                N/E

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~test Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE       :   node.dc02.domain:3000   node.dc03.domain:3000   node.dc04.domain:3000   node.dc06.domain:3000   node.dc07.domain:3000   node.dc09.domain:3000   node.dc11.domain:3000   node.dc12.domain:3000   node.dc13.domain:3000   node.dc14.domain:3000   node.dc15.domain:3000   node.dc16.domain:3000   node.dc17.domain:3000   node.dc18.domain:3000   node.dc19.domain:3000   node.dc20.domain:3000   node.dc21.domain:3000   node.dc22.domain:3000   node.dc23.domain:3000   node.dc24.domain:3000   node.dc25.domain:3000   node.dc26.domain:3000   node.dc27.domain:3000   node.dc28.domain:3000   node.dc29.domain:3000   node.dc30.domain:3000   node.dc32.domain:3000
repl-factor:   1                                  1                                  2                                  1                                  1                                  1                                  1                                  1                                  1                                  2                                  2                                  1                                  1                                  1                                  1                                  1                                  2                                  1                                  1                                  1                                  1                                  1                                  1                                  1                                  1                                  2                                  2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~incProfiles Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE       :   node.dc02.domain:3000   node.dc03.domain:3000   node.dc04.domain:3000   node.dc06.domain:3000   node.dc07.domain:3000   node.dc09.domain:3000   node.dc11.domain:3000   node.dc12.domain:3000   node.dc13.domain:3000   node.dc14.domain:3000   node.dc15.domain:3000   node.dc16.domain:3000   node.dc17.domain:3000   node.dc18.domain:3000   node.dc19.domain:3000   node.dc20.domain:3000   node.dc21.domain:3000   node.dc22.domain:3000   node.dc23.domain:3000   node.dc24.domain:3000   node.dc25.domain:3000   node.dc26.domain:3000   node.dc27.domain:3000   node.dc28.domain:3000   node.dc29.domain:3000   node.dc30.domain:3000   node.dc32.domain:3000
repl-factor:   1                                  1                                  2                                  1                                  1                                  1                                  1                                  1                                  1                                  2                                  2                                  1                                  1                                  1                                  1                                  1                                  2                                  1                                  1                                  1                                  1                                  1                                  1                                  1                                  1                                  2                                  2

I tried tip command to join nodes but that didn’t help: asinfo -h node.dc02.domain -v 'tip:host=node.dc05.domain;port=3002'

So the question why did that happen and how to fix that? Also offtopic question: is there is ability to turn rebalance off?


#2

How many network interface does each host have?

In version 3.9.1 (and below) the following two setting would help assign an interface and bind an IP for the heartbeats.

http://www.aerospike.com/docs/reference/configuration#network-interface-name

http://www.aerospike.com/docs/reference/configuration#interface-address

You may need to specify them in aerospike.conf to ensure usage of the right interface for cluster formation.

These have been deprecated in version 3.10.x but should still work in your version.

You could temporarily stop migrations (re-balancing) by setting migrate-threads to zero, but that is not recommended.


#3

Thank you for the answer. The problem was in incorrect host names in mesh-seed-address-port. Only one node was specified correctly and when it went down the cluster became broken. I hope that was the cause. Now the cluster works well.