Aerospikes strange behaviour when link between nodes goes down

Strange behaviour when link between nodes goes down.

Network

In a cluster of 3 with a replication factor of 5, all nodes were interconnected and each had 198066 records ( Each node master for 1/3 and replicas for 2/3 ).

Node1 -- Node2
  |      /
  |    /
Node3

Issue

When Link between Node1 and Node3 went down:

  • Host 1 reported Cluster size 3 ( can only reach Host2 )
  • Host 2 reported Cluster size 3 ( can reach both )
  • Host 3 reported Cluster size 2 ( can only reach Host2 )

Fetching all records using scan with fail_on_cluster_change = false policy or using query with where bin=1 clause:

  • Before link went down → Host1: 198066, Host2: 198066, Host3: 198066,
  • straight after link went down → Host1: 132231, Host2: 198066, Host3: 134178,
  • A second after link went down and since → Host1: 164552, Host2: 198066, Host3: 198066,
  • Straight after the link came back → Host1: 198066, Host2: 198066, Host3: 198066,

Fetching all records using scan with fail_on_cluster_change = true policy:

  • straight after link went down → Host1: Error 7, Host2: 198066, Host3: 198066,
  • A second after link went down and since → Host1: 164552, Host2: 198066, Host3: 198066,
  • For few minutes when link was restored → Host1: Error 7, Host2: Error 7, Host3: Error 7,
  • After this → Host1: 198066, Host2: 198066, Host3: 198066,

Questions

  • When link goes down, why Host1 always have access to only 164552, while Host3 to all records
  • Why host 1 reports cluster size 3, while host3 reports cluster size 2?
  • Why when link is down Scan on Host1 ( with fail_on_cluster_change = true policy ) always fails, while on Host 3 it succeeds ?

Environment

All running Aerospike-server community edition 3.11.0.1 ( but same results with 3.10.0.1.1 )

aerospike.conf:

service {
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 1024
}

logging {
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}

network {
    service {
        address any
        access-address 192.168.1.1
        alternate-access-address 192.168.2.1
        port 3000
    }

    heartbeat {
        # To use unicast-mesh heartbeats, remove the 3 lines above, and see
        mode mesh
        port 3002 # Heartbeat port for this node.

        # List one or more other nodes, one ip-address & port per line:
        mesh-seed-address-port 192.168.2.1 3002
        mesh-seed-address-port 192.168.1.2 3002
        mesh-seed-address-port 192.168.2.2 3002
    }

    fabric {
        address any
        port 3001
    }

    info {
        port 3003
    }
}

namespace test {
    replication-factor 5
    memory-size 4G
    default-ttl 30d # 30 days, use 0 to never expire/evict.

    storage-engine device {
        file /opt/aerospike.dat
        filesize 4G
        data-in-memory true # Store data in memory in addition to file.
    }
}

Couple of questions 1- what is your actual topology? I am guessing something like this:

{router}
  \    \------------\
{router1}          {router2}
   \               \       \---------\
{node1}         {node2}            {node3}

When you show a ring topology, how exactly are you linking the nodes, how are you bringing a link down?

2 - why replication factor of 5 with only 3 nodes in the cluster?

1- what is your actual topology? I am guessing something like this:

there are no routers in-between, it’s a test network and they are attached directly. Technically it looks flat, i.e. Node2 just forwards requests between Node1 & Node 3: Node1 <—> Node2 <—> Node3. And I blocked forwarding to see what would it do.

2 - why replication factor of 5 with only 3 nodes in the cluster?

As part of reliability check to see how Aerospike behaves in different situations. Interestingly here Node 1 doesn’t have access to roughly 1/5th of the records.

http://www.aerospike.com/docs/architecture/data-distribution.html says you cannot have a cluster with replication factor greater than number of nodes. If you started a 3 node cluster with replication-factor 3, each node will have all the data - exact copy. At least you must start your test with a valid configuration and once the cluster is formed, bring one node down. The idea behind replication factor is to not lose master data should one node be lost. Most folks set it to 2, some set it to 3. Every once in a while, folks using Aerospike primarily as a read only database would do replication factor == number of nodes and use each node as a very fast lookup table server.

So ( replication factor > nodes ) is known to cause issues?

When i read it i anticipated similar behaviour to Cassandra.

  1. So what happens in cluster of 5 with replication of 5, when 1-2nodes or connection disappears? Is that not supported, hence strange behaviour?
  2. Another use case for this is rolling out additional nodes with no downtime, while having all nodes to replicate everything. Is this not supported case? Are there any workaround for such use case?

Thanks -vytas

You cannot expect to have 5 copies of data on a 3 node cluster.

If replication-factor is set greater than number of nodes (available), server will treat replication-factor == number of nodes. (Unless paxos-single-replica-limit kicks in.) Also, replication-factor for a namespace is a static and unanimous. ie It requires a cluster wide restart to change it and cannot be changed dynamically on a running cluster. Hope that answers both 1 and 2. 5 with 5, 1 goes down, rep is treated as 4. RF5 with 3 nodes, RF is treated as 3.

Hard to understand exactly what you are doing to break the cluster, step by step. Trust you are using Community Edition for your testing. You may be seeing migrations when cluster is changed. Wait for migrations to complete and then look at the records. You can monitor state of migrations on AMC. Try with Enterprise Edition - you will get the Rapid Rebalance feature.

For this mode of operation (replication factor > number of nodes, all data on all nodes for fast lookup) , Aerospike Java client provides reading from replicas at random instead of the default of going to master for this mode of operation.

See Replica - aerospike-client 7.2.0 javadoc (Look at RANDOM)

You cannot expect to have 5 copies of data on a 3 node cluster.

If replication-factor is set greater than number of nodes (available), server will treat replication-factor == number of nodes. (Unless paxos-single-replica-limit kicks in.) Also, replication-factor for a namespace is a static and unanimous. ie It requires a cluster wide restart to change it and cannot be changed dynamically on a running cluster. Hope that answers both 1 and 2. 5 with 5, 1 goes down, rep is treated as 4. RF5 with 3 nodes, RF is treated as 3.

Great, so this is what I expected and what I want. 5 nodes, 2 goes down, replication factor goes down to 3. 2 nodes come back, hence replication factor comes back to 5.

Hard to understand exactly what you are doing to break the cluster, step by step. Trust you are using Community Edition for your testing. You may be seeing migrations when cluster is changed. Wait for migrations to complete and then look at the records. You can monitor state of migrations on AMC. Try with Enterprise Edition - you will get the Rapid Rebalance feature.

All nodes are running Aerospike-server community edition 3.11.0.1 ( but same results with 3.10.0.1.1 ). It’s simulating an edge case where some failure occurs and some nodes can’t reach others, while others can reach all nodes. I am fine if it is not supported, I am just trying to understand how it behaves in different scenarios.

Furthermore, I have tested 3 nodes with replication factor set to 3, and it behaves exactly the same as I described above ( 1 node has less data available and can’t write, other 2 nodes have correct amount of data and can write ). I waited couple hours for migration to happen, but it is still broken. The broken node gets in following loop:

Jan 16 2017 10:21:15 GMT: WARNING (cf:socket): (socket.c:760) Error while connecting FD 69: 111 (Connection refused)
Jan 16 2017 10:21:15 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:16 GMT: WARNING (cf:socket): (socket.c:760) Error while connecting FD 69: 111 (Connection refused)
Jan 16 2017 10:21:16 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:16 GMT: INFO (hb): (hb.c:2638) Marking node removal for paxos recovery: bb9d12c7cbae290
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9d12c7cbae290 and self bb9755a597ac40c
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:268) Node List []
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9a0c114ede000 and self bb9755a597ac40c
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:268) Node List [bb9d12c7cbae290,bb9a0c114ede000]
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2477) Corrective changes: 1. Integrity fault: true
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2520) Marking node add for paxos recovery: bb9a0c114ede000
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2520) Marking node add for paxos recovery: bb9755a597ac40c
Jan 16 2017 10:21:16 GMT: INFO (paxos): (paxos.c:2543) Skipping paxos recovery: bb9a0c114ede000 will handle the recovery
Jan 16 2017 10:21:17 GMT: WARNING (cf:socket): (socket.c:750) Timeout while connecting FD 69
Jan 16 2017 10:21:17 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9d12c7cbae290 and self bb9755a597ac40c
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:268) Node List []
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9a0c114ede000 and self bb9755a597ac40c
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:17 GMT: INFO (paxos): (paxos.c:268) Node List [bb9d12c7cbae290,bb9a0c114ede000]
Jan 16 2017 10:21:17 GMT: WARNING (cf:socket): (socket.c:750) Timeout while connecting FD 69
Jan 16 2017 10:21:17 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:18 GMT: WARNING (cf:socket): (socket.c:750) Timeout while connecting FD 69
Jan 16 2017 10:21:18 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:18 GMT: INFO (hb): (hb.c:2638) Marking node removal for paxos recovery: bb9d12c7cbae290
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9d12c7cbae290 and self bb9755a597ac40c
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:268) Node List []
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2614) Cluster Integrity Check: Detected succession list discrepancy between node bb9a0c114ede000 and self bb9755a597ac40c
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9d12c7cbae290,bb9a0c114ede000,bb9755a597ac40c]
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:268) Node List [bb9d12c7cbae290,bb9a0c114ede000]
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2477) Corrective changes: 1. Integrity fault: true
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2520) Marking node add for paxos recovery: bb9a0c114ede000
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2520) Marking node add for paxos recovery: bb9755a597ac40c
Jan 16 2017 10:21:18 GMT: INFO (paxos): (paxos.c:2543) Skipping paxos recovery: bb9a0c114ede000 will handle the recovery
Jan 16 2017 10:21:18 GMT: WARNING (cf:socket): (socket.c:760) Error while connecting FD 69: 111 (Connection refused)
Jan 16 2017 10:21:18 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002
Jan 16 2017 10:21:19 GMT: WARNING (cf:socket): (socket.c:760) Error while connecting FD 69: 111 (Connection refused)
Jan 16 2017 10:21:19 GMT: WARNING (cf:socket): (socket.c:819) Error while connecting socket to 192.168.2.2:3002

As mention above, I appreciate it’s an edge case and I am fine with behaviour if it is unsupported failure mode.

Thanks, -Vytas

My guess is you are causing a very unusual scenario for an operating cluster that only port 3002 is blocked after being open earlier. The server tries to form a cluster and keep it together based on various things, heartbeat being one of them. If a node sees a successful write to replica where it is the master, it is going to conclude that the far node is alive and perhaps the heartbeat is temporarily dead. Server makes its very best effort using all intra-node communication markers before declaring a node dead because that initiates a rebalance effort in a typical use case of RF = 2 and multi-node (>2) cluster which is not to be indulged in unless absolutely essential.

My guess is you are constantly forcing the cluster to break itself up and reform because it is unable to continue since in the config file you are saying I will provide heartbeat on port 3002 and then you are blocking that communication. It knows through other protocols that there is one more node, wants to get heartbeat going but the connection is blocked.

For the depths you are exploring, I think you may like reading this paper: http://www.vldb.org/pvldb/vol9/p1389-srinivasan.pdf (See section 2.1)

For RF==num nodes, no migrations will occur when you lose a node. But when you add a node, migrations will be triggered.