Asadm health causes node dropout and prevents ping

Hi,

I’m experiencing very strange behaviour whereby a simple cluster (3 nodes, very small default bar and test namespaces) is running fine on GCE then when I run asadm and the health command on a node, that node will drop out of the cluster and I can no longer ping between it and the other nodes. A completely different VM on the same network can ping both the outcast and the nodes on the remaining cluster. Restarting aerospike has no effect. Restarting the VMs seems to be the only way to reset it.

Although I can ssh into any node from any of the other nodes (before running health), running the health command throws these errors: Aerospike Interactive Shell, version 0.1.11

Found 3 nodes
Online:  10.132.0.3:3000, 10.132.0.1:3000, 10.132.0.2:3000
Admin> health -v
INFO: Collecting 3 collectinfo snapshot. Use -n to set number of snapshots.
ERROR: Couldn't make SSH login to remote server 10.132.0.3:22, please provide correct credentials.
ERROR: Couldn't make SSH login to remote server 10.132.0.2:22, please provide correct credentials.
INFO: Snapshot 1
ERROR: Couldn't make SSH login to remote server 10.132.0.2:22, please provide correct credentials.
ERROR: Couldn't make SSH login to remote server 10.132.0.3:22, please provide correct credentials.

I found this while investigating a larger issue of node dropout and painfully slow pending migrates when we have the larger namespaces attached, I’m hoping it’s related (maybe node drops out and migrates halt).

All nodes are identical on GCE; 36GB RAM, 4 CPUs, 1TB Disk, Aerospike C-3.14.1.3, same network/region, identical aerospike.conf (mesh).

I’m totally stumped so any ideas or pointers would be very much appreciated - thanks!

You’re saying asadm causes the asd process to crash? Is there a stack trace in the log, or can you share the last few log lines as the node goes down? What shows in the log when you try to restart Aerospike? Have you tried running some simple ‘asinfo -v’ commands against each node individually?

Hi,

No sorry, asd didn’t seem to crash at all, it just seemed to sever all network connectivity with the other nodes in the cluster (I know it sounds very odd).

Unfortunately we’re very close to a project deadline so I’ve had to reinstate the large namespaces and the system is still coming up and since the reboot journalctl doesn’t have the previous logs I can go by memory:

Beforehand:

CLUSTER-SIZE 3, all nodes have same config & namespaces

After: Node1:

CLUSTER-SIZE 1
Error while connecting socket to 10.132.0.2
could not create heartbeat connection to node 10.132.0.2
Error while connecting socket to 10.132.0.3
could not create heartbeat connection to node 10.132.0.3

Shows test & bar namespaces in just Node 1

Nodes 2 & 3:

CLUSTER-SIZE 2
Error while connecting socket to 10.132.0.1
could not create heartbeat connection to node 10.132.0.1

Shows test & bar namespaces distributed over Nodes 2 & 3

The nodes in the severed clusters keep repeating their connection errors every 10s after this.

I’ll try to recreate the problem today with new VMs from the same image and and update this post with actual logs.

Thanks

Ok, here you go:

BEFORE:

Sep 29 2017 06:28:48 GMT: INFO (info): (ticker.c:167) NODE-ID bb90e00840a0142 CLUSTER-SIZE 3
Sep 29 2017 06:28:47 GMT: INFO (info): (ticker.c:167) NODE-ID bb91100840a0142 CLUSTER-SIZE 3
Sep 29 2017 06:28:48 GMT: INFO (info): (ticker.c:167) NODE-ID bb91200840a0142 CLUSTER-SIZE 3

root@astest1:~# asinfo -v namespaces
test;bar

root@astest2:~# asinfo -v namespaces
test;bar

ASADM HEALTH:

root@astest1:~# asadm
Aerospike Interactive Shell, version 0.1.11

Found 3 nodes
Online:  10.132.0.14:3000, 10.132.0.17:3000, 10.132.0.18:3000

Admin> health -v
INFO: Collecting 3 collectinfo snapshot. Use -n to set number of snapshots.
ERROR: Couldn't make SSH login to remote server 10.132.0.18:22, please provide correct credentials.
ERROR: Couldn't make SSH login to remote server 10.132.0.17:22, please provide correct credentials.
INFO: Snapshot 1
ERROR: Couldn't make SSH login to remote server 10.132.0.17:22, please provide correct credentials.
ERROR: Couldn't make SSH login to remote server 10.132.0.18:22, please provide correct credentials.

AFTER: Log on astest1:

Sep 29 2017 06:34:20 GMT: WARNING (socket): (socket.c:752) Timeout while connecting
Sep 29 2017 06:34:20 GMT: WARNING (socket): (socket.c:820) Error while connecting socket to 10.132.0.17:3002
Sep 29 2017 06:34:20 GMT: WARNING (hb): (hb.c:4702) could not create heartbeat connection to node {10.132.0.17:3002}
Sep 29 2017 06:34:20 GMT: WARNING (socket): (socket.c:820) Error while connecting socket to 10.132.0.18:3002
Sep 29 2017 06:34:20 GMT: WARNING (hb): (hb.c:4702) could not create heartbeat connection to node {10.132.0.18:3002}
Sep 29 2017 06:34:21 GMT: INFO (hb): (hb.c:8279) node expired bb91100840a0142
Sep 29 2017 06:34:21 GMT: INFO (fabric): (fabric.c:2385) fabric: node bb91100840a0142 departed
Sep 29 2017 06:34:21 GMT: INFO (fabric): (fabric.c:884) fabric_node_disconnect(bb91100840a0142)
Sep 29 2017 06:34:22 GMT: INFO (hb): (hb.c:8279) node expired bb91200840a0142
Sep 29 2017 06:34:22 GMT: INFO (fabric): (fabric.c:2385) fabric: node bb91200840a0142 departed
Sep 29 2017 06:34:22 GMT: INFO (fabric): (fabric.c:884) fabric_node_disconnect(bb91200840a0142)
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:7225) principal node bb91200840a0142 departed - switching to orphan state
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:5383) moved self node to orphan state
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:7834) principal node - forming new cluster with succession list: bb90e00840a0142
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:5540) applied new cluster key e72a5bfd76fc
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:7834) applied new succession list bb90e00840a0142
Sep 29 2017 06:34:22 GMT: INFO (clustering): (clustering.c:5544) applied cluster size 1
Sep 29 2017 06:34:22 GMT: INFO (exchange): (exchange.c:2003) data exchange started with cluster key e72a5bfd76fc
Sep 29 2017 06:34:22 GMT: INFO (exchange): (exchange.c:2617) received commit command from principal node bb90e00840a0142
Sep 29 2017 06:34:22 GMT: INFO (exchange): (exchange.c:2561) data exchange completed with cluster key e72a5bfd76fc
Sep 29 2017 06:34:22 GMT: INFO (partition): (partition_balance.c:1042) {test} replication factor is 1
Sep 29 2017 06:34:22 GMT: INFO (partition): (partition_balance.c:1016) {test} re-balanced, expected migrations - (0 tx, 0 rx, 0 sig)
Sep 29 2017 06:34:22 GMT: INFO (partition): (partition_balance.c:1042) {bar} replication factor is 1
Sep 29 2017 06:34:22 GMT: INFO (partition): (partition_balance.c:1016) {bar} re-balanced, expected migrations - (0 tx, 0 rx, 0 sig)
Sep 29 2017 06:34:28 GMT: INFO (info): (ticker.c:167) NODE-ID bb90e00840a0142 CLUSTER-SIZE 1

Log on astest2:

Sep 29 2017 06:34:20 GMT: WARNING (socket): (socket.c:752) Timeout while connecting
Sep 29 2017 06:34:20 GMT: WARNING (socket): (socket.c:820) Error while connecting socket to 10.132.0.14:3002
Sep 29 2017 06:34:20 GMT: WARNING (hb): (hb.c:4702) could not create heartbeat connection to node {10.132.0.14:3002}
Sep 29 2017 06:34:21 GMT: INFO (hb): (hb.c:8279) node expired bb90e00840a0142
Sep 29 2017 06:34:21 GMT: INFO (fabric): (fabric.c:2385) fabric: node bb90e00840a0142 departed
Sep 29 2017 06:34:21 GMT: INFO (fabric): (fabric.c:884) fabric_node_disconnect(bb90e00840a0142)
Sep 29 2017 06:34:21 GMT: INFO (clustering): (clustering.c:5540) applied new cluster key 1e7c0c59fda1
Sep 29 2017 06:34:21 GMT: INFO (clustering): (clustering.c:7834) applied new succession list bb91200840a0142 bb91100840a0142
Sep 29 2017 06:34:21 GMT: INFO (clustering): (clustering.c:5544) applied cluster size 2
Sep 29 2017 06:34:21 GMT: INFO (exchange): (exchange.c:2003) data exchange started with cluster key 1e7c0c59fda1
Sep 29 2017 06:34:21 GMT: INFO (exchange): (exchange.c:2617) received commit command from principal node bb91200840a0142
Sep 29 2017 06:34:21 GMT: INFO (exchange): (exchange.c:2561) data exchange completed with cluster key 1e7c0c59fda1
Sep 29 2017 06:34:21 GMT: INFO (partition): (partition_balance.c:1042) {test} replication factor is 2
Sep 29 2017 06:34:21 GMT: INFO (partition): (partition_balance.c:1016) {test} re-balanced, expected migrations - (1373 tx, 1320 rx, 1373 sig)
Sep 29 2017 06:34:21 GMT: INFO (partition): (partition_balance.c:1042) {bar} replication factor is 2
Sep 29 2017 06:34:21 GMT: INFO (partition): (partition_balance.c:1016) {bar} re-balanced, expected migrations - (1373 tx, 1320 rx, 1373 sig)
Sep 29 2017 06:34:28 GMT: INFO (info): (ticker.c:167) NODE-ID bb91100840a0142 CLUSTER-SIZE 2

root@astest1:~# asinfo -v namespaces
test;bar

root@astest2:~# asinfo -v namespaces
test;bar

Stumbled upon this topic. Upgrade asadm to a more recent version that would make it easier to configure the ssh credentials that are needed to access different nodes for system information. Also, the health command is not really fully ready (and should for sure not be advertised as such - let us know if you found any documentation about it, we wouldn’t expect Aerospike users to make use of it at this point).

Hi meher,

Thanks I can try that in the next maintenance window. One big problem is that if it fails the nodes have to be restarted and it could take > 1 day to come back online, so can’t try straightaway.

Cheers, John