Cluster (Error: (1) unstable-cluster)

Hello!

We have 3 servers. Replication factor 3. 1 of 3 node down. So we going to this node and restart it. Now we have error on requests to servers and in aql: Error: (1) unstable-cluster

So questions

  1. Why we have this error?
  2. We want to have servers online, even if 1 node down and coming to cluster. What we should do for it?

Are you using Enterprise Edition with Strong Consistency mode? What server version?

We are using Community Edition with saving data to SSD

4.5.1.5, 4.5.1.5, 4.5.1.6

namespace name {
    replication-factor 3
    memory-size 4G
    default-ttl 0
    storage-engine device {
            file /path/backup.dat
            filesize 25G
            write-block-size 1M
            data-in-memory false
    }

}

My best guess is that you have a bad network issue where a node is going in and out of the cluster due to heartbeat channel issue. Are you using mesh heartbeat or multicast? You have full data on each node - 3 nodes, replica 3 - and in AP mode, replication factor must automatically drop to 2 when one node goes down. Does the problem happen when you insert the node? If the node just drops out, you should be fine unless there is a heartbeat issue between the two remaining nodes.

node1

heartbeat {
		mode mesh
		address ip_node1
		port 3002
		mesh-seed-address-port ip_node2 3002
		mesh-seed-address-port ip_node3 3002

		interval 150
		timeout 20
	}

node2

heartbeat {
		mode mesh
		address ip_node2
		port 3002
		mesh-seed-address-port ip_node1 3002
		mesh-seed-address-port ip_node3 3002

		interval 150
		timeout 20
	}

node3

heartbeat {
		mode mesh
		address ip_node3
		port 3002
		mesh-seed-address-port ip_node1 3002
		mesh-seed-address-port ip_node2 3002

		interval 150
		timeout 20
	}

When node is going down - I have 1-2 seconds problems. When node is coming back - I have 1-2 hours, another time - 5-6 hours and then everything is being okay. 3 servers are in different cities.

Whats the problem? How to solve it?

My suggestion would be to scan through the log files and see if there is a network issue that stands out. The fact that your servers are in three different cities may be the root cause.

I believe the issue is the “FAIL_ON_CLUSTER_CHANGE” policy for scans and/or secondary index queries.

See KB for details.

The current implementation of scans and queries cannot ensure all available data matching the request will returned, nor can it ensure that the returned data is only returned once while migrations are ongoing. So the ‘FAIL_ON_CLUSTER_CHANGE’ policy prevents these anomalies by not allowing these APIs. You can change the policy so that it does best effort during these events.

Replacement server APIs are currently being planned for scans and queries that will be capable of ensuring non-duplicated delivery of all available data during disruptions/migrations.

I believe the issue is the “FAIL_ON_CLUSTER_CHANGE” policy for scans and/or secondary index queries.

So why I have problems with AQL if it is problem with API? I have seen logs - I have a lot of error by:

WARNING (socket): (socket.c:891) Timeout while connecting
May 22 2019 00:55:57 GMT: WARNING (socket): (socket.c:959) 
Error while connecting socket to nodeip 3002

May 22 2019 00:55:57 GMT: WARNING (socket): (socket.c:959) Error while connecting socket to ip:3002
May 22 2019 00:55:57 GMT: WARNING (hb): (hb.c:4882) could not create heartbeat connection to node {ip:3002}
May 22 2019 00:56:02 GMT: WARNING (socket): (socket.c:959) (repeated:27) Error while connecting socket to ip:3002
May 22 2019 00:56:02 GMT: WARNING (socket): (socket.c:959) (repeated:26) Error while connecting socket to ip:3002

May 22 2019 00:56:02 GMT: WARNING (hb): (hb.c:4882) (repeated:27) could not create heartbeat connection to node {ip:3002}
May 22 2019 00:56:02 GMT: WARNING (hb): (hb.c:4882) (repeated:26) could not create heartbeat connection to node {ip:3002}

Im using iptables, but 3002 port is open for udp and tcp connections from both nodes in each node. Also there is not any iptables blocked in logs from this servers.

AQL is just an application that uses C Client API underneath. What commands are you issuing with AQL that give error? (Separately, looks like you have some network issue on your heartbeat connection.)

Commands like:

show sets
select * from ns.set
select id from ns.set where pk='123'

select * will launch a scan.

select id from ns.set where pk=‘123’ – that one specifically - does that give you an error?

Anyway, discussion regarding AQL is kind of tangential. You really need to troubleshoot your network. With three cluster nodes in three different cities - if your heartbeat connections keep dropping out for greater than 3.3 seconds in your settings - you will end up in the situation you are seeing.

Its also give an error

What I should do? A lot of people have 2-3 servers in different locations, so they resolving these problems. What to do? On your main page of Aerospike - there is information, that aero working good if even down 2/3 nodes. But in fact it is not true. Isnt it?

My best guess is that you have a flaky connection where a node keeps joining and leaving the cluster. This should also show up in the logs. grep for “CLUSTERING” and see how often that is happening.

$ grep clustering /var/log/aerospike/aerospike.log

also, grep for exchange

No, this isn’t a common deployment pattern. Typically each data-center hosts a single cluster and XDR is used to asynchronously propagate updates to other data-centers. A deployment pattern similar to yours is using multiple ‘Availability Zones’ within a DC, but here the latency and networking between the Availability Zones is typically really good.

When a node unexpectedly leaves the cluster (i.e. not quiesced first), it takes about (heartbeat.timeout * heartbeat.interval) milliseconds for other nodes to discover - during this time writes targeting the original node will timeout and reads (depending on policy) will be retried on one of the remaining nodes. In EE you could use the client’s rack-aware features to ensure clients in each city only read from their local server.

select * will launch a scan.

select id from ns.set where pk=‘123’ – that one specifically - does that give you an error?

Checked again: select * - not available - unstable cluster but if

select id from ns.set where pk=‘123’ - its working

grep clustering /var/log/aerospike/aerospike.log May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:node arrived event_ts:95036856

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:node departed event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:principal departed event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:peer adjacency changed event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:join request accepted event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:merge candidate seen event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2620) Fault:member orphaned event_ts:0

May 23 2019 03:34:06 GMT: DETAIL (clustering): (clustering.c:2624) Last Quantum interval:95036695

Also have error with these messages:

May 23 2019 03:34:09 GMT: DETAIL (hlc): (hlc.c:224) changed HLC value from 102143290562119180 to 102143290562119181

May 23 2019 03:34:09 GMT: DETAIL (clustering): (clustering.c:6021) re-sending cluster join request to node 3

May 23 2019 03:34:09 GMT: DEBUG (clustering): (clustering.c:6023) re-sent cluster join request to 3

May 23 2019 03:34:09 GMT: DETAIL (clustering): (clustering.c:6267) join request to principal 3 pending - not attempting new join request

May 23 2019 03:34:09 GMT: DETAIL (hlc): (hlc.c:303) message received from node 3 with HLC 102143290564544016 - changed HLC value from 102143290562119181$

May 23 2019 03:34:09 GMT: DEBUG (clustering): (clustering.c:4960) received paxos prepare from node 3

May 23 2019 03:34:09 GMT: DETAIL (hlc): (hlc.c:224) changed HLC value from 102143290564609553 to 102143290564609554

May 23 2019 03:34:09 GMT: DEBUG (clustering): (clustering.c:4843) paxos promise message sent to node 3 with proposal id (3:102143290564544015)

May 23 2019 03:34:09 GMT: DEBUG (socket): (socket.c:1165) Error while receiving on FD 55: 11 (Resource temporarily unavailable)

May 23 2019 03:34:09 GMT: DETAIL (hlc): (hlc.c:303) message received from node 3 with HLC 102143290564609556 - changed HLC value from 102143290564609554$

May 23 2019 03:34:09 GMT: DEBUG (clustering): (clustering.c:5048) received paxos accept from node 3

May 23 2019 03:34:09 GMT: DETAIL (hlc): (hlc.c:224) changed HLC value from 102143290564675093 to 102143290564675094

May 23 2019 03:34:09 GMT: DEBUG (clustering): (clustering.c:4885) paxos accepted message sent to node 3 with proposal id (3:102143290564544015)

May 23 2019 03:34:09 GMT: DEBUG (socket): (socket.c:1165) Error while receiving on FD 43: 11 (Resource temporarily unavailable)

So what I should to do, to resolve my problem? Is there any ways?

So select * … .which scans all records - is facing an unstable cluster. Need to figure out why your cluster is unstable. Can you share ping latencies between your nodes?

Ping? Okay, lets see:

Ping from Node1 to Node2

PING node2_ip (node2_ip) 56(84) bytes of data.
64 bytes from node2_ip: icmp_seq=1 ttl=61 time=8.98 ms
64 bytes from node2_ip: icmp_seq=2 ttl=61 time=9.01 ms
64 bytes from node2_ip: icmp_seq=3 ttl=61 time=9.07 ms
64 bytes from node2_ip: icmp_seq=4 ttl=61 time=9.06 ms

Ping from Node1 to Node3
PING node3_ip (node3_ip) 56(84) bytes of data.
64 bytes from node3_ip: icmp_seq=1 ttl=61 time=9.04 ms
64 bytes from node3_ip: icmp_seq=2 ttl=61 time=9.09 ms
64 bytes from node3_ip: icmp_seq=3 ttl=61 time=8.99 ms
64 bytes from node3_ip: icmp_seq=4 ttl=61 time=9.17 ms

Ping from Node2 to Node1
PING node1_ip (node1_ip) 56(84) bytes of data.
64 bytes from node1_ip: icmp_seq=1 ttl=60 time=8.97 ms
64 bytes from node1_ip: icmp_seq=2 ttl=60 time=9.02 ms
64 bytes from node1_ip: icmp_seq=3 ttl=60 time=9.05 ms
64 bytes from node1_ip: icmp_seq=4 ttl=60 time=9.10 ms

Ping from Node2 to Node3
PING node3_ip (node3_ip) 56(84) bytes of data.
64 bytes from node3_ip: icmp_seq=1 ttl=64 time=0.245 ms
64 bytes from node3_ip: icmp_seq=2 ttl=64 time=0.227 ms
64 bytes from node3_ip: icmp_seq=3 ttl=64 time=0.244 ms
64 bytes from node3_ip: icmp_seq=4 ttl=64 time=0.254 ms

Ping from Node3 to Node1
PING node1_ip (node1_ip) 56(84) bytes of data.
64 bytes from node1_ip: icmp_seq=1 ttl=60 time=9.04 ms
64 bytes from node1_ip: icmp_seq=2 ttl=60 time=9.06 ms
64 bytes from node1_ip: icmp_seq=3 ttl=60 time=9.06 ms
64 bytes from node1_ip: icmp_seq=4 ttl=60 time=9.04 ms

Ping from Node3 to Node2
PING node2_ip (node2_ip) 56(84) bytes of data.
64 bytes from node2_ip: icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from node2_ip: icmp_seq=2 ttl=64 time=0.249 ms
64 bytes from node2_ip: icmp_seq=3 ttl=64 time=0.229 ms
64 bytes from node2_ip: icmp_seq=4 ttl=64 time=0.249 ms

Thanks - I am running some tests. Will get back.

Ping between nodes is low, so I think we should not get XDR and another enterprise functionality. I thought that basic functionality exist simple way to be db online without problems! Is it problem with FAIL_ON_CLUSTER_CHANGE? Or not?